提问人:Arthur Spoon 提问时间:8/16/2023 最后编辑:Arthur Spoon 更新时间:8/17/2023 访问量:39
SQL Redshift - 计算类似列表的字符串列中的重复值
SQL Redshift - Count duplicate values in list-like string column
问:
我正在使用一个表来处理 Redshift SQL,该表有一个字符串列,其中包含一个逗号分隔的列表,如下所示(实际数据有 15 个值,而不仅仅是 5 个):
grid_state
------------------
"T0,NA,NA,NA,NA,"
"T3,T3,T3,NA,NA,"
"U0,T4,T3,T4,NA,"
"T1,NA,NA,T1,NA,"
"T3,T3,NA,T2,NA,"
"T2,T2,T0,T1,NA,"
"T3,T3,T3,T2,T1,"
"U0,T4,T3,T4,U0,"
"T2,T2,T2,T1,T1,"
"T3,T3,T3,T0,NA,"
我特别想做的是,忽略值,计算每个字符串中有多少对重复值,所以这里的结果应该是:NA
duplicate_pairs
----------------
0
1
1
1
1
1
1
2
2
1
如果我使用常规编程语言,我会写一个漂亮的小函数来迭代值或更高级的东西,但我对 SQL 非常陌生,虽然我知道有一种使用函数的方法,但我对它们不是很有信心,并且还没有在网上找到一个足够接近我需要使用它的示例。
我目前正在做的基本上是我想在函数中做的事情,但是有非常混乱的重复语句和新的列创建,这并没有完全优化或可扩展:case when
with messy_actions as (
select
replace(grid_state, 'NA,', '') as grid_monsters_0,
case when len(grid_monsters_0) > 0
then regexp_count(grid_monsters_0, left(grid_monsters_0, 3)) / 2 end as duplicate_pairs_0,
case when len(grid_monsters_0) > 0
then replace(grid_monsters_0, left(grid_monsters_0, 3), '') end as grid_monsters_1,
case when len(grid_monsters_1) > 0
then regexp_count(grid_monsters_1, left(grid_monsters_1, 3)) / 2 end as duplicate_pairs_1,
case when len(grid_monsters_1) > 0
then replace(grid_monsters_1, left(grid_monsters_1, 3), '') end as grid_monsters_2,
case when len(grid_monsters_2) > 0
then regexp_count(grid_monsters_2, left(grid_monsters_2, 3)) / 2 end as duplicate_pairs_2,
case when len(grid_monsters_2) > 0
then replace(grid_monsters_2, left(grid_monsters_2, 3), '') end as grid_monsters_3,
case when len(grid_monsters_3) > 0
then regexp_count(grid_monsters_3, left(grid_monsters_3, 3)) / 2 end as duplicate_pairs_3,
case when len(grid_monsters_3) > 0
then replace(grid_monsters_3, left(grid_monsters_3, 3), '') end as grid_monsters_4,
case when len(grid_monsters_4) > 0
then regexp_count(grid_monsters_4, left(grid_monsters_4, 3)) / 2 end as duplicate_pairs_4,
from actions
)
select
duplicate_pairs_0 + duplicate_pairs_1 + duplicate_pairs_2 +
duplicate_pairs_3 + duplicate_pairs_4 as duplicate_pairs
这在某种程度上比检查每个可能值的(甚至更多)暴力解决方案更可取,因为目前有 45 个可能的值,并且列表将来可能会扩展。regexp_count
从我在这个答案中看到的内容来看,解决方案可能是定义一个函数并在语句中使用它,但我也读到这可能也不是最好的,正如我所说,我还不太习惯用 SQL 编写函数(更习惯于 Python 和 R)。select
编辑:
- 第一句话应该更准确地说明我正在处理的表有一个这样的字符串列以及许多其他列。
- 我正在使用的表有后端团队对我施加的限制,即我必须使用一组通用的列,这些列与我们拥有的所有其他项目通用,并巧妙地命名,因此保存规范化的数据不是一种选择。
generic_[int|string|bool|float]_[0-5]
答:
1赞
dfundako
8/16/2023
#1
下面是分步演练,说明如何通过拆分逗号上的每个值,然后计算该值的出现次数来执行此操作:
create temp table sample_table (val varchar(100));
insert into sample_table
values
('T0,NA,NA,NA,NA,'),
('T3,T3,T3,NA,NA,'),
('U0,T4,T3,T4,NA,'),
('T1,NA,NA,T1,NA,'),
('T3,T3,NA,T2,NA,'),
('T2,T2,T0,T1,NA,'),
('T3,T3,T3,T2,T1,'),
('U0,T4,T3,T4,U0,'),
('T2,T2,T2,T1,T1,'),
('T3,T3,T3,T0,NA,');
--take out NA vals and give each row an id
with recursive cte AS (
select row_number() OVER (order by 1) AS id, replace(val, 'NA,', '') AS newval
from sample_table
),
-- setup a list of numbers 1-15 that we'll use for splitting
numbers(counter_) AS (
select 1 AS counter_
union all
select counter_ + 1
from numbers
where counter_ <= 15
),
-- cross join numbers to all vals to use for splitting
split_setup AS (
select *
from cte
cross join numbers
),
-- split the vals from comma lists to rows
split_vals AS (
select id,
split_part(newval,',', counter_) AS splitvals
from split_setup
where splitvals != ''
),
-- count occurrences of each val
final AS (
select id,
splitvals, count(*) AS valcounter
from split_vals
group by 1,2
)
-- check if there are dupes
select id,
sum(case when valcounter > 1 THEN 1 ELSE 0 END) AS duplicate_pairs
from final
group by 1
order by id
评论
0赞
Arthur Spoon
8/16/2023
非常感谢,明天第一件事就是尝试一下,如果一切正常,请接受!
上一个:无法使用 cd 向前移动
评论