SQL Redshift - 计算类似列表的字符串列中的重复值

SQL Redshift - Count duplicate values in list-like string column

提问人:Arthur Spoon 提问时间:8/16/2023 最后编辑:Arthur Spoon 更新时间:8/17/2023 访问量:39

问:

我正在使用一个表来处理 Redshift SQL,该表有一个字符串列,其中包含一个逗号分隔的列表,如下所示(实际数据有 15 个值,而不仅仅是 5 个):

grid_state
------------------
"T0,NA,NA,NA,NA,"
"T3,T3,T3,NA,NA,"
"U0,T4,T3,T4,NA,"
"T1,NA,NA,T1,NA,"
"T3,T3,NA,T2,NA,"
"T2,T2,T0,T1,NA,"
"T3,T3,T3,T2,T1,"
"U0,T4,T3,T4,U0,"
"T2,T2,T2,T1,T1,"
"T3,T3,T3,T0,NA,"

我特别想做的是,忽略值,计算每个字符串中有多少对重复值,所以这里的结果应该是:NA

duplicate_pairs
----------------
0
1
1
1
1
1
1
2
2
1

如果我使用常规编程语言,我会写一个漂亮的小函数来迭代值或更高级的东西,但我对 SQL 非常陌生,虽然我知道有一种使用函数的方法,但我对它们不是很有信心,并且还没有在网上找到一个足够接近我需要使用它的示例。

我目前正在做的基本上是我想在函数中做的事情,但是有非常混乱的重复语句和新的列创建,这并没有完全优化或可扩展:case when

with messy_actions as (
    select
        replace(grid_state, 'NA,', '') as grid_monsters_0,
        case when len(grid_monsters_0) > 0
            then regexp_count(grid_monsters_0, left(grid_monsters_0, 3)) / 2 end as duplicate_pairs_0,
        case when len(grid_monsters_0) > 0
            then replace(grid_monsters_0, left(grid_monsters_0, 3), '') end as grid_monsters_1,
        case when len(grid_monsters_1) > 0
            then regexp_count(grid_monsters_1, left(grid_monsters_1, 3)) / 2 end as duplicate_pairs_1,
        case when len(grid_monsters_1) > 0
            then replace(grid_monsters_1, left(grid_monsters_1, 3), '') end as grid_monsters_2,
        case when len(grid_monsters_2) > 0
            then regexp_count(grid_monsters_2, left(grid_monsters_2, 3)) / 2 end as duplicate_pairs_2,
        case when len(grid_monsters_2) > 0
            then replace(grid_monsters_2, left(grid_monsters_2, 3), '') end as grid_monsters_3,
        case when len(grid_monsters_3) > 0
            then regexp_count(grid_monsters_3, left(grid_monsters_3, 3)) / 2 end as duplicate_pairs_3,
        case when len(grid_monsters_3) > 0
            then replace(grid_monsters_3, left(grid_monsters_3, 3), '') end as grid_monsters_4,
        case when len(grid_monsters_4) > 0
            then regexp_count(grid_monsters_4, left(grid_monsters_4, 3)) / 2 end as duplicate_pairs_4,
    from actions
)
select
    duplicate_pairs_0 + duplicate_pairs_1 + duplicate_pairs_2 +
    duplicate_pairs_3 + duplicate_pairs_4 as duplicate_pairs

这在某种程度上比检查每个可能值的(甚至更多)暴力解决方案更可取,因为目前有 45 个可能的值,并且列表将来可能会扩展。regexp_count

从我在这个答案中看到的内容来看,解决方案可能是定义一个函数并在语句中使用它,但我也读到这可能也不是最好的,正如我所说,我还不太习惯用 SQL 编写函数(更习惯于 Python 和 R)。select

编辑:

  1. 第一句话应该更准确地说明我正在处理的表有一个这样的字符串列以及许多其他列。
  2. 我正在使用的表有后端团队对我施加的限制,即我必须使用一组通用的列,这些列与我们拥有的所有其他项目通用,并巧妙地命名,因此保存规范化的数据不是一种选择。generic_[int|string|bool|float]_[0-5]
SQL 字符串 函数 复制 Amazon-Redshift

评论

0赞 nbk 8/16/2023
保存标准化的数据,从而省去了拆分数据的麻烦,这会花费大量资源。查看 stackoverflow.com/questions/3653462/...
0赞 Arthur Spoon 8/17/2023
感谢您的评论@nbk,通读该问题和答案非常有趣,但是我无法保存在我的情况下标准化的数据,我现在为问题添加了一些上下文:)
0赞 nbk 8/17/2023
然后拆分字符串,你有一个规范化的版本,你需要一个唯一的列 cc,这样你就可以识别你没有提供的组,所以这个问题不能得到很好的回答

答:

1赞 dfundako 8/16/2023 #1

下面是分步演练,说明如何通过拆分逗号上的每个值,然后计算该值的出现次数来执行此操作:

create temp table sample_table (val varchar(100));
insert into sample_table
    values
('T0,NA,NA,NA,NA,'),
('T3,T3,T3,NA,NA,'),
('U0,T4,T3,T4,NA,'),
('T1,NA,NA,T1,NA,'),
('T3,T3,NA,T2,NA,'),
('T2,T2,T0,T1,NA,'),
('T3,T3,T3,T2,T1,'),
('U0,T4,T3,T4,U0,'),
('T2,T2,T2,T1,T1,'),
('T3,T3,T3,T0,NA,');

--take out NA vals and give each row an id
with recursive cte AS (
    select row_number() OVER (order by 1) AS id, replace(val, 'NA,', '') AS newval
    from sample_table
),
-- setup a list of numbers 1-15 that we'll use for splitting
numbers(counter_) AS (
    select 1 AS counter_
    union all
    select counter_ + 1
    from numbers
    where counter_ <= 15
),
-- cross join numbers to all vals to use for splitting
split_setup AS (
    select *
    from cte
    cross join numbers
),
-- split the vals from comma lists to rows
split_vals AS (
    select id,
           split_part(newval,',', counter_) AS splitvals
    from split_setup
    where splitvals != ''
),
-- count occurrences of each val
final AS (
    select id,
           splitvals, count(*) AS valcounter
from split_vals
group by 1,2
)
-- check if there are dupes
select id,
       sum(case when valcounter > 1 THEN 1 ELSE 0 END) AS duplicate_pairs

from final

group by 1

order by id

评论

0赞 Arthur Spoon 8/16/2023
非常感谢,明天第一件事就是尝试一下,如果一切正常,请接受!