Teradata SQL 联接与最近邻近似匹配

Teradata SQL join with approximate match by nearest Neighbor

提问人:Ahmed Mohammed Abdel Kader 提问时间:12/6/2022 更新时间:12/7/2022 访问量:100

问:

我有两个表,我需要用不完全匹配的键连接这些表,它们可能在最后 3 位或 4 位数字中有所不同,没有固定的模式,连接将有两个条件:table1table2

  • 第一个条件:必须数值大于,因此 join 将搜索最近的大键table2.Key_2table1.Key_1
  • 第二个条件:可以有两个键与一个键大致匹配,因此联接将考虑最近的大键,另一个返回 Nulltable1table2

|Key_1             |
|------------------|
|111330317223636588|
|121199074223629055|
|47256714523667238 |
|101348062023590858|
|106331320423644206|
|106061712623646625|
|120562195823631381|
|155570817823642550|
|114948476223640334|
|103285939423669298|
|103285939423669910|
|Key_2              |Value|
|-------------------+-----|
|111330317223636610 |Done |
|121199074223629090 |Done |
|47256714523667300  |Done |
|101348062023590900 |Done |
|106331320423644305 |Done |
|106061712623647100 |Done |
|120562195823631399 |Done |
|155570817823642677 |Done |
|114948476223640455 |Done |
|103285939423669988 |Done |

预期结果如下:

|Key_1              |Value|
|-------------------+-----|
|111330317223636588 |Done |
|121199074223629055 |Done |
|47256714523667238  |Done |
|101348062023590858 |Done |
|106331320423644206 |Done |
|106061712623646625 |Done |
|120562195823631381 |Done |
|155570817823642550 |Done |
|114948476223640334 |Done |
|103285939423669298 |NULL | 
|103285939423669910 |Done |
SQL 匹配 Teradata 最近邻 近似

评论


答:

2赞 dnoeth 12/7/2022 #1

下一个值上的此连接很难表示为连接,但使用这种方法很简单:

with cte as
 ( -- combine both key columns into one 
   select key_1, cast(null as varchar(10)) as val from table1
   union all
   select key_2 as key_1, "value" from table2
  -- order by 1
 )
select 
   key_1
   -- return the next rows value
  ,lead(val)
   over (order by key_1) as "value"
  ,val
from cte
-- filter only rows from the first table
qualify val is null

cte 结合了两个表:

 Key_1              val <-- NULLs indicate rows from table_1
 ------------------ ----- 
  47256714523667238 NULL 
  47256714523667300 Done  
 101348062023590858 NULL  
 101348062023590900 Done  
 103285939423669298 NULL
 103285939423669910 NULL  
 103285939423669988 Done  
 106061712623646625 NULL  
 106061712623647100 Done  
 106331320423644206 NULL  
 106331320423644305 Done  
 111330317223636588 NULL  
 111330317223636610 Done  
 114948476223640334 NULL  
 114948476223640455 Done  
 120562195823631381 NULL  
 120562195823631399 Done  
 121199074223629055 NULL  
 121199074223629090 Done  
 155570817823642550 NULL  
 155570817823642677 Done  

LEAD查找下一行的值:

 Key_1              val  value 
 ------------------ ---- ----- 
  47256714523667238 NULL Done 
  47256714523667300 Done NULL 
 101348062023590858 NULL Done 
 101348062023590900 Done NULL 
 103285939423669298 NULL NULL 
 103285939423669910 NULL Done 
 103285939423669988 Done NULL 
 106061712623646625 NULL Done 
 106061712623647100 Done NULL 
 106331320423644206 NULL Done 
 106331320423644305 Done NULL 
 111330317223636588 NULL Done 
 111330317223636610 Done NULL 
 114948476223640334 NULL Done 
 114948476223640455 Done NULL 
 120562195823631381 NULL Done 
 120562195823631399 Done NULL 
 121199074223629055 NULL Done 
 121199074223629090 Done NULL 
 155570817823642550 NULL Done 
 155570817823642677 Done NULL 

最终的 QUALIFY 从第 2 个表中删除所有行

 Key_1              value 
 ------------------ ----- 
  47256714523667238 Done 
 101348062023590858 Done 
 103285939423669298 NULL 
 103285939423669910 Done 
 106061712623646625 Done 
 106331320423644206 Done 
 111330317223636588 Done 
 114948476223640334 Done 
 120562195823631381 Done 
 121199074223629055 Done 
 155570817823642550 Done 

评论

0赞 Ahmed Mohammed Abdel Kader 12/7/2022
提供的解决方案非常适合解释的示例,但在我使用的实际数据中,两个表中都会选择多个列,因此并集效率不高,并且每个表超过 6000 万行。我正在寻找连接技术的解决方案
1赞 dnoeth 12/7/2022
基于联接的解决方案将基于非等联接,因此会产生 60 mio * 60 mio 行的产品联接。虽然我的方法可以很容易地修改为仅使用两个表的 PK 列,然后重新加入基表,但应该更有效率。