提问人:Ahmed Mohammed Abdel Kader 提问时间:12/6/2022 更新时间:12/7/2022 访问量:100
Teradata SQL 联接与最近邻近似匹配
Teradata SQL join with approximate match by nearest Neighbor
问:
我有两个表,我需要用不完全匹配的键连接这些表,它们可能在最后 3 位或 4 位数字中有所不同,没有固定的模式,连接将有两个条件:table1
table2
- 第一个条件:必须数值大于,因此 join 将搜索最近的大键
table2.Key_2
table1.Key_1
- 第二个条件:可以有两个键与一个键大致匹配,因此联接将考虑最近的大键,另一个返回 Null
table1
table2
|Key_1 |
|------------------|
|111330317223636588|
|121199074223629055|
|47256714523667238 |
|101348062023590858|
|106331320423644206|
|106061712623646625|
|120562195823631381|
|155570817823642550|
|114948476223640334|
|103285939423669298|
|103285939423669910|
|Key_2 |Value|
|-------------------+-----|
|111330317223636610 |Done |
|121199074223629090 |Done |
|47256714523667300 |Done |
|101348062023590900 |Done |
|106331320423644305 |Done |
|106061712623647100 |Done |
|120562195823631399 |Done |
|155570817823642677 |Done |
|114948476223640455 |Done |
|103285939423669988 |Done |
预期结果如下:
|Key_1 |Value|
|-------------------+-----|
|111330317223636588 |Done |
|121199074223629055 |Done |
|47256714523667238 |Done |
|101348062023590858 |Done |
|106331320423644206 |Done |
|106061712623646625 |Done |
|120562195823631381 |Done |
|155570817823642550 |Done |
|114948476223640334 |Done |
|103285939423669298 |NULL |
|103285939423669910 |Done |
答:
2赞
dnoeth
12/7/2022
#1
下一个值上的此连接很难表示为连接,但使用这种方法很简单:
with cte as
( -- combine both key columns into one
select key_1, cast(null as varchar(10)) as val from table1
union all
select key_2 as key_1, "value" from table2
-- order by 1
)
select
key_1
-- return the next rows value
,lead(val)
over (order by key_1) as "value"
,val
from cte
-- filter only rows from the first table
qualify val is null
cte 结合了两个表:
Key_1 val <-- NULLs indicate rows from table_1
------------------ -----
47256714523667238 NULL
47256714523667300 Done
101348062023590858 NULL
101348062023590900 Done
103285939423669298 NULL
103285939423669910 NULL
103285939423669988 Done
106061712623646625 NULL
106061712623647100 Done
106331320423644206 NULL
106331320423644305 Done
111330317223636588 NULL
111330317223636610 Done
114948476223640334 NULL
114948476223640455 Done
120562195823631381 NULL
120562195823631399 Done
121199074223629055 NULL
121199074223629090 Done
155570817823642550 NULL
155570817823642677 Done
LEAD查找下一行的值:
Key_1 val value
------------------ ---- -----
47256714523667238 NULL Done
47256714523667300 Done NULL
101348062023590858 NULL Done
101348062023590900 Done NULL
103285939423669298 NULL NULL
103285939423669910 NULL Done
103285939423669988 Done NULL
106061712623646625 NULL Done
106061712623647100 Done NULL
106331320423644206 NULL Done
106331320423644305 Done NULL
111330317223636588 NULL Done
111330317223636610 Done NULL
114948476223640334 NULL Done
114948476223640455 Done NULL
120562195823631381 NULL Done
120562195823631399 Done NULL
121199074223629055 NULL Done
121199074223629090 Done NULL
155570817823642550 NULL Done
155570817823642677 Done NULL
最终的 QUALIFY 从第 2 个表中删除所有行
Key_1 value
------------------ -----
47256714523667238 Done
101348062023590858 Done
103285939423669298 NULL
103285939423669910 Done
106061712623646625 Done
106331320423644206 Done
111330317223636588 Done
114948476223640334 Done
120562195823631381 Done
121199074223629055 Done
155570817823642550 Done
评论
0赞
Ahmed Mohammed Abdel Kader
12/7/2022
提供的解决方案非常适合解释的示例,但在我使用的实际数据中,两个表中都会选择多个列,因此并集效率不高,并且每个表超过 6000 万行。我正在寻找连接技术的解决方案
1赞
dnoeth
12/7/2022
基于联接的解决方案将基于非等联接,因此会产生 60 mio * 60 mio 行的产品联接。虽然我的方法可以很容易地修改为仅使用两个表的 PK 列,然后重新加入基表,但应该更有效率。
评论