在子窗口 SQL Presto 中执行具有特定条件的值计数

Perform Value Count with Specific Conditions inside subwindows SQL Presto

提问人:Ekaterina Kochurova 提问时间:10/24/2023 最后编辑:SelVaziEkaterina Kochurova 更新时间:10/24/2023 访问量:30

问:

假设我有一个包含 3 列的表:id、date_time、color。数据如下所示:

id, date_time, color
1, 2023-10-01 12:15, green
1, 2023-10-01 12:16, yellow
1, 2023-10-01 12:17, yellow
1, 2023-10-01 12:18, red
1, 2023-10-01 12:19, yellow
1, 2023-10-01 12:20, yellow
1, 2023-10-01 12:21, red
1, 2023-10-01 12:22, red
1, 2023-10-01 12:23, green
1, 2023-10-01 12:24, yellow
1, 2023-10-01 12:25, yellow
1, 2023-10-01 12:26, red
2, 2023-10-01 12:27, red
2, 2023-10-01 12:28, green
2, 2023-10-01 12:29, green
2, 2023-10-01 12:30, yellow

我需要计算值“黄色”出现在“颜色”列中的频率,按“id”列分组,按date_time排序。但是,我有特定条件:

  1. 只有当“黄色”出现在“绿色”之后时,我才想计算它。
  2. 只有当“黄色”后面跟着第一个“红色”,或者它是“id”定义的组中的最后一个值时,我才想计算“黄色”。

它看起来像组中的子窗口。

我在 AWS Athena 中使用 SQL Presto,我认为我应该使用窗口函数,但我不确定如何指定这些条件。

提前感谢您的提示

因此,预期结果应该是:

For id=1: Count "yellow" = 4
For id=2: Count "yellow" = 1

我试过这个,但缺少在条件下重复黄色的累积计数器。

`with testdata(id, date_time, color) as (
  VALUES 
  (1, cast('2023-10-01 12:15:00' as timestamp), 'green'),
  (1, cast('2023-10-01 12:16:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:17:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:18:00' as timestamp), 'red'),
  (1, cast('2023-10-01 12:19:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:20:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:21:00' as timestamp), 'red'),
  (1, cast('2023-10-01 12:22:00' as timestamp), 'red'),
  (1, cast('2023-10-01 12:23:00' as timestamp), 'green'),
  (1, cast('2023-10-01 12:24:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:25:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:26:00' as timestamp), 'red'),
  (2, cast('2023-10-01 12:27:00' as timestamp), 'red'),
  (2, cast('2023-10-01 12:28:00' as timestamp), 'green'),
  (2, cast('2023-10-01 12:29:00' as timestamp), 'green'),
  (2, cast('2023-10-01 12:30:00' as timestamp), 'yellow')
)
,t1 as (
  SELECT id,
    date_time,
    color,
    LAG(color) OVER (
      PARTITION BY id
      ORDER BY date_time
    ) AS prev_color,
    LEAD(color) OVER (
      PARTITION BY id
      ORDER BY date_time
    ) AS next_color
  FROM testdata
)
select id,
  SUM(
    CASE
      WHEN color = 'yellow'
      AND (
        prev_color = 'green'
        and (
          next_color IS NULL
          OR next_color = 'red'
          OR next_color = 'yellow'
        )
      ) THEN 1 ELSE 0
    END
  )  AS yellow_count
FROM t1
group by id`

我有价值观

  • 对于 id=1:计数“黄色”= 2(不正确)
  • 对于 id=2:计数“黄色”= 1(正确)
SQL 窗口函数 Amazon-Athena Presto 间隙和岛屿

评论


答:

1赞 SelVazi 10/24/2023 #1

你有一个间隙和孤岛问题,你可以利用两个row_numbers之间的差值来为每组连续的行提供一个唯一的id:

with data (id, date_time, color) as (
  VALUES 
  (1, cast('2023-10-01 12:15:00' as timestamp), 'green'),(1, cast('2023-10-01 12:16:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:17:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:18:00' as timestamp), 'red'),(1, cast('2023-10-01 12:19:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:20:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:21:00' as timestamp), 'red'),(1, cast('2023-10-01 12:22:00' as timestamp), 'red'),(1, cast('2023-10-01 12:23:00' as timestamp), 'green'),(1, cast('2023-10-01 12:24:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:25:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:26:00' as timestamp), 'red'),(2, cast('2023-10-01 12:27:00' as timestamp), 'red'),(2, cast('2023-10-01 12:28:00' as timestamp), 'green'),(2, cast('2023-10-01 12:29:00' as timestamp), 'green'),(2, cast('2023-10-01 12:30:00' as timestamp), 'yellow')
),
cte AS (
  SELECT id, date_time, color, 
         ROW_NUMBER() OVER (PARTITION BY id ORDER BY date_time)
         - ROW_NUMBER() OVER (PARTITION BY id, color ORDER BY date_time) AS rn
  FROM data
),
cte2 AS (
  SELECT id, rn, 
         MAX(color) AS color,
         MAX(date_time) AS date_time,
         SUM(CASE WHEN color = 'yellow' THEN 1 END) AS total
  FROM cte c1
  GROUP BY id, rn
  ORDER BY date_time
),
cte3 AS (
  SELECT *, LAG(color) OVER ( PARTITION BY id ORDER BY date_time) AS prev_color,
          LEAD(color) OVER ( PARTITION BY id ORDER BY date_time) AS next_color 
  FROM cte2
)
SELECT id, SUM(total) AS total
FROM cte3
WHERE color = 'yellow' AND prev_color = 'green' AND ( next_color = 'red' OR next_color is null)
GROUP BY id

解释:

  1. 使用两个row_numbers之间的差值为每个连续的行组提供唯一的 ID。

  2. 按 id 和 rn 对数据进行分组,以便轻松获取每种颜色的上一个和下一个值。

  3. 然后对 cte3 的结果应用条件以获得预期的数据。

结果:

编号
1 4
2 1

postgresql 演示