提问人:Ekaterina Kochurova 提问时间:10/24/2023 最后编辑:SelVaziEkaterina Kochurova 更新时间:10/24/2023 访问量:30
在子窗口 SQL Presto 中执行具有特定条件的值计数
Perform Value Count with Specific Conditions inside subwindows SQL Presto
问:
假设我有一个包含 3 列的表:id、date_time、color。数据如下所示:
id, date_time, color
1, 2023-10-01 12:15, green
1, 2023-10-01 12:16, yellow
1, 2023-10-01 12:17, yellow
1, 2023-10-01 12:18, red
1, 2023-10-01 12:19, yellow
1, 2023-10-01 12:20, yellow
1, 2023-10-01 12:21, red
1, 2023-10-01 12:22, red
1, 2023-10-01 12:23, green
1, 2023-10-01 12:24, yellow
1, 2023-10-01 12:25, yellow
1, 2023-10-01 12:26, red
2, 2023-10-01 12:27, red
2, 2023-10-01 12:28, green
2, 2023-10-01 12:29, green
2, 2023-10-01 12:30, yellow
我需要计算值“黄色”出现在“颜色”列中的频率,按“id”列分组,按date_time排序。但是,我有特定条件:
- 只有当“黄色”出现在“绿色”之后时,我才想计算它。
- 只有当“黄色”后面跟着第一个“红色”,或者它是“id”定义的组中的最后一个值时,我才想计算“黄色”。
它看起来像组中的子窗口。
我在 AWS Athena 中使用 SQL Presto,我认为我应该使用窗口函数,但我不确定如何指定这些条件。
提前感谢您的提示
因此,预期结果应该是:
For id=1: Count "yellow" = 4
For id=2: Count "yellow" = 1
我试过这个,但缺少在条件下重复黄色的累积计数器。
`with testdata(id, date_time, color) as (
VALUES
(1, cast('2023-10-01 12:15:00' as timestamp), 'green'),
(1, cast('2023-10-01 12:16:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:17:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:18:00' as timestamp), 'red'),
(1, cast('2023-10-01 12:19:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:20:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:21:00' as timestamp), 'red'),
(1, cast('2023-10-01 12:22:00' as timestamp), 'red'),
(1, cast('2023-10-01 12:23:00' as timestamp), 'green'),
(1, cast('2023-10-01 12:24:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:25:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:26:00' as timestamp), 'red'),
(2, cast('2023-10-01 12:27:00' as timestamp), 'red'),
(2, cast('2023-10-01 12:28:00' as timestamp), 'green'),
(2, cast('2023-10-01 12:29:00' as timestamp), 'green'),
(2, cast('2023-10-01 12:30:00' as timestamp), 'yellow')
)
,t1 as (
SELECT id,
date_time,
color,
LAG(color) OVER (
PARTITION BY id
ORDER BY date_time
) AS prev_color,
LEAD(color) OVER (
PARTITION BY id
ORDER BY date_time
) AS next_color
FROM testdata
)
select id,
SUM(
CASE
WHEN color = 'yellow'
AND (
prev_color = 'green'
and (
next_color IS NULL
OR next_color = 'red'
OR next_color = 'yellow'
)
) THEN 1 ELSE 0
END
) AS yellow_count
FROM t1
group by id`
我有价值观
- 对于 id=1:计数“黄色”= 2(不正确)
- 对于 id=2:计数“黄色”= 1(正确)
答:
1赞
SelVazi
10/24/2023
#1
你有一个间隙和孤岛问题,你可以利用两个row_numbers之间的差值来为每组连续的行提供一个唯一的id:
with data (id, date_time, color) as (
VALUES
(1, cast('2023-10-01 12:15:00' as timestamp), 'green'),(1, cast('2023-10-01 12:16:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:17:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:18:00' as timestamp), 'red'),(1, cast('2023-10-01 12:19:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:20:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:21:00' as timestamp), 'red'),(1, cast('2023-10-01 12:22:00' as timestamp), 'red'),(1, cast('2023-10-01 12:23:00' as timestamp), 'green'),(1, cast('2023-10-01 12:24:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:25:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:26:00' as timestamp), 'red'),(2, cast('2023-10-01 12:27:00' as timestamp), 'red'),(2, cast('2023-10-01 12:28:00' as timestamp), 'green'),(2, cast('2023-10-01 12:29:00' as timestamp), 'green'),(2, cast('2023-10-01 12:30:00' as timestamp), 'yellow')
),
cte AS (
SELECT id, date_time, color,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date_time)
- ROW_NUMBER() OVER (PARTITION BY id, color ORDER BY date_time) AS rn
FROM data
),
cte2 AS (
SELECT id, rn,
MAX(color) AS color,
MAX(date_time) AS date_time,
SUM(CASE WHEN color = 'yellow' THEN 1 END) AS total
FROM cte c1
GROUP BY id, rn
ORDER BY date_time
),
cte3 AS (
SELECT *, LAG(color) OVER ( PARTITION BY id ORDER BY date_time) AS prev_color,
LEAD(color) OVER ( PARTITION BY id ORDER BY date_time) AS next_color
FROM cte2
)
SELECT id, SUM(total) AS total
FROM cte3
WHERE color = 'yellow' AND prev_color = 'green' AND ( next_color = 'red' OR next_color is null)
GROUP BY id
解释:
使用两个row_numbers之间的差值为每个连续的行组提供唯一的 ID。
按 id 和 rn 对数据进行分组,以便轻松获取每种颜色的上一个和下一个值。
然后对 cte3 的结果应用条件以获得预期的数据。
结果:
编号 | 总 |
---|---|
1 | 4 |
2 | 1 |
评论