用于提取连续时间戳范围的 SQL

SQL to Extract Continuous Timestamp Ranges

提问人:Net Dawg 提问时间:10/14/2023 最后编辑:Net Dawg 更新时间:10/18/2023 访问量:90

问:

表Device_Status

同上 地位 时间戳
1 积极 2023-01-13T18.00.01.0187528
2 积极 2023-01-13T18.00.01.0187529
1 失败 2023-01-13T18.00.01.0187530
3 积极 2023-01-13T18.00.01.0187531
1 失败 2023-01-13T18.00.01.0187532
1 积极 2023-01-13T18.00.01.0187533
3 积极 2023-01-13T18.00.01.0187534
1 失败 2023-01-13T18.00.01.0187535
4 失败 2023-01-13T18.00.01.0187536
1 积极 2023-01-13T18.00.01.0187537

预期输出(需要SQL查询生成):

编号 Fail_Begin Fail_End
1 2023-01-13T18.00.01.0187530 2023-01-13T18.00.01.0187532
1 2023-01-13T18.00.01.0187535 2023-01-13T18.00.01.0187535
4 2023-01-13T18.00.01.0187536 2023-01-13T18.00.01.0187536

基本上,对于每个 ID,获取 min(时间戳)和 max(时间戳),但要覆盖该 ID 的连续记录。如果只有一条记录,则 min=max,就像样本结果集中的第二条和第三条记录一样。

我已经尝试过这个(以及其中的各种子查询变体)

SELECT Id, min(Timestamp) AS Fail_Begin, max(Timestamp) AS Fail_End
FROM Device_Status
GROUP BY Id

但需要仅按连续发生的记录进行分组,

因此,也许首先添加状态更改的概念会有所帮助?从零开始作为初始状态,然后在每次下一条记录不同时将更改代码递增 1,从而生成这样的中间结果......

表Device_Status_With_Change_Column

同上 地位 改变 时间戳
1 积极 0 2023-01-13T18.00.01.0187528
2 积极 0 2023-01-13T18.00.01.0187529
1 失败 1 2023-01-13T18.00.01.0187530
3 积极 0 2023-01-13T18.00.01.0187531
1 失败 1 2023-01-13T18.00.01.0187532
1 积极 2 2023-01-13T18.00.01.0187533
3 积极 0 2023-01-13T18.00.01.0187534
1 失败 3 2023-01-13T18.00.01.0187535
4 失败 0 2023-01-13T18.00.01.0187536
1 积极 4 2023-01-13T18.00.01.0187537

然后做

SELECT Id, Change, min(Timestamp) AS Fail_Begin, max(Timestamp) AS Fail_End
FROM Device_Status_With_Change_Column
GROUP BY Id, Change

除了在编程语言中循环遍历结果集之外,我还没有看到一个直接的 SQL 语句可以在没有中间表的情况下一举完成此操作,并且我看不到如何计算列 Change(在 SQL 中)。

SQL 循环 联接 子查询 sql-timestamp

评论

3赞 Tim Biegeleisen 10/14/2023
Need SQL query to generate...您应该检查预期的输出,因为它与文本描述不一致。

答:

1赞 Ajax1234 10/14/2023 #1

下面的解决方案首先使用子查询将行号与每条记录相关联,该行号按块分区。然后,子查询被编辑到自身:failedidjoin

with cte as (
   select row_number() over (partition by t.id order by t.timestamp) r, t.* 
   from device_status t where t.status = 'Failed'
)
select c.id, c.timestamp fail_start, coalesce(c1.timestamp, c.timestamp) fail_end
from cte c left join cte c1 on c.id = c1.id and c.r + 1 = c1.r
where c.r % 2 = 1

看小提琴

评论

0赞 Net Dawg 10/15/2023
你能检查一下代码吗?基本上,该错误很容易被发现。在最后一个“活动”之后同时插入几条“失败”记录。SQL 查询结果中应该正好有一条附加记录,而不是有几条记录。(请参阅 dbfiddle.uk/bZBi_ye1)
1赞 Net Dawg 10/14/2023 #2

按用户@Ajax1234的解决方案在原始帖子中生成结果。但是,当使用多个串联“失败”记录进行压力测试时,它会生成对,而不是根据这些附加表条目的需要生成一条附加记录。

DDL/Schema 和初始示例数据,针对 PostGreSQL 版本 16 进行了测试,更改/修剪了解析工作的时间戳,并添加了几个串联的“失败”记录:

CREATE TYPE status AS ENUM ('Active', 'Failed');
DROP TABLE IF EXISTS Device_Status;
CREATE TABLE Device_Status
(
  ID integer,
  Status status,
  temptime text
);
INSERT INTO Device_Status VALUES (1,'Active','2023-01-13 10.00.01.528');
INSERT INTO Device_Status VALUES (2,'Active','2023-01-13 10.00.01.529');
INSERT INTO Device_Status VALUES (1,'Failed','2023-01-13 10.00.01.530');
INSERT INTO Device_Status VALUES (3,'Active','2023-01-13 10.00.01.531');
INSERT INTO Device_Status VALUES (1,'Failed','2023-01-13 10.00.01.532');
INSERT INTO Device_Status VALUES (1,'Active','2023-01-13 10.00.01.533');
INSERT INTO Device_Status VALUES (3,'Active','2023-01-13 10.00.01.534');
INSERT INTO Device_Status VALUES (1,'Failed','2023-01-13 10.00.01.535');
INSERT INTO Device_Status VALUES (4,'Failed','2023-01-13 10.00.01.536');
INSERT INTO Device_Status VALUES (1,'Active','2023-01-13 10.00.01.537');
insert into device_status values (1,'Failed','2023-01-13 10:00:01.538'), 
                                 (1,'Failed','2023-01-13 10:00:01.539'), 
                                 (1,'Failed','2023-01-13 10:00:01.540'), 
                                 (1,'Failed','2023-01-13 10:00:01.541'), 
                                 (1,'Failed','2023-01-13 10:00:01.542'), 
                                 (1,'Failed','2023-01-13 10:00:01.543'), 
                                 (1,'Failed','2023-01-13 10:00:01.544'),
                                 (1,'Failed','2023-01-13 10:00:01.545'), 
                                 (1,'Failed','2023-01-13 10:00:01.546'),
                                 (1,'Failed','2023-01-13 10:00:01.547');
ALTER TABLE Device_Status ADD timestamp TIMESTAMP;
UPDATE Device_Status set timestamp = 
      to_timestamp(temptime,'YYYY-MM-DD HH:MI:SS.MS');
ALTER TABLE Device_Status DROP COLUMN temptime;
WITH Device_Status_With_Change_Column AS
(
  WITH flag AS
  (
    WITH lag AS
    (
      SELECT *, 
             LAG(status) OVER (PARTITION BY Id ORDER BY TimeStamp) Last_Status,
             LAG(Id,1,0) OVER (PARTITION BY Id ORDER BY TimeStamp) Last_Id
      FROM Device_Status
    )
    SELECT *, 
           CASE
           WHEN lag.last_status IS NULL THEN 0
           WHEN lag.last_status IS DISTINCT FROM lag.status 
            AND lag.last_id = lag.id  
            THEN 1
           ELSE 0
         END Change_Flag
     FROM lag
  )
  SELECT *, 
         SUM(change_flag) OVER (PARTITION BY Id ORDER BY TimeStamp) Change
  FROM flag
)
SELECT Id, min(Timestamp) AS Fail_Begin, max(Timestamp) AS Fail_End
FROM Device_Status_With_Change_Column
WHERE status = 'Failed'
GROUP BY Id, Change
ORDER BY Id

结果(小提琴):

编号 fail_start fail_end
1 2023-01-13 10:00:01.53 2023-01-13 10:00:01.532
1 2023-01-13 10:00:01.535 2023-01-13 10:00:01.535
1 2023-01-13 10:00:01.538 2023-01-13 10:00:01.547
4 2023-01-13 10:00:01.536 2023-01-13 10:00:01.536

最里面的 CTE 使用 SQL LAG 函数为以前的条目生成一列(即,对于 id 和 status,恰好是相应列的前一行,如果是 id,则默认为零)。当同一 ID 的先前状态从“活动”更改为“失败”时,这将被投影为输入 1,反之亦然。任何 ID 的第一个实例都将具有无条件零,并且相对于任何设备的先前记录的日志条目的状态没有变化的情况也是如此。要创建 CTE,使用累积 SUM 函数对Change_Flags求和,以记录按时间戳排序时每个设备 ID 的状态更改序列号。“更改”列中的零可确保 SUM 将具有相同更改序列的所有连续失败或活动记录中放置相同的整数。换句话说,要获得结果集的“活动”图像,我们只需要更改 WHERE 子句。因此,只需在 DDL 和查询中添加相应的单词,即可解决任意数量的状态,例如“暂停”、“离线”。lagflagDevice_Status_With_Change_Column

完整的 CTE(小提琴):Device_Status_With_Change_Column

SELECT * 
FROM Device_Status_With_Change_Column

我通过在包含 1300 万条记录的数据库上@Ajax1234来测试基于联接的查询,以生成大约 20,000 条结果集,并且比基于 CTE 的解决方案更快(24 秒对 42 秒)。

SQL比做循环要强大得多!使用 LAG、SUM 和 CASE 语句查看 CTE。对关系乘积的类似联接的运算性能更高,如果阅读和理解有点困难,因此需要测试。

评论

0赞 Net Dawg 10/18/2023
对于非数字 Id,请使用 LAG(Id,1,NULL) 而不是 LAG(Id,1,0) 或根据 RDBMS 正确类型转换默认值。
0赞 Net Dawg 10/18/2023 #3

我自己问题的另一个答案。

小提琴

WITH cte AS -- the common table expression (CTE)
(
   -- table projection with all leading and lagging columns to then inform begin and end of failed states 
    SELECT 
            LAG(id) OVER (PARTITION BY id ORDER BY timestamp) previous_id, 
            LAG(status) OVER (PARTITION BY id ORDER BY timestamp) previous_status, 
            LAG(timestamp) OVER (PARTITION BY id ORDER BY timestamp) previous_timestamp,
            id, 
            status,
            timestamp,
            LEAD(id) OVER (PARTITION BY id ORDER BY timestamp) next_id, 
            LEAD(status) OVER (PARTITION BY id ORDER BY timestamp) next_status, 
            LEAD(timestamp) OVER (PARTITION BY id ORDER BY timestamp) next_timestamp
  
    FROM device_status
),
-- AND WITH (comment for better readability of multiple queries being introduced on the CTE 
-- fail start table being defined as FS, 
fs AS 
(
  -- the order of fail start and, later similarly, failed are preserved in row_number
  SELECT ROW_NUMBER() OVER (ORDER BY id, fail_start) r, x.* FROM 
  ( 
    -- if next _status changed to failed for same id that is the start timestamp.  
    SELECT id, next_timestamp AS fail_start FROM cte
    WHERE next_status = 'Failed' AND next_status IS DISTINCT FROM status AND next_id = id
     UNION
    -- making an exception for when it is the only fail record
    SELECT id, timestamp AS fail_start FROM cte
    WHERE previous_status IS NULL AND status = 'Failed'
   ) x
),
-- AND WITH
-- fail end table FE, mirror image of FS
fe AS 
(
  SELECT ROW_NUMBER() OVER (ORDER BY id, fail_end) r, y.* FROM 
  (   
    SELECT id, previous_timestamp AS fail_end FROM cte
    WHERE previous_status = 'Failed' AND previous_status IS DISTINCT FROM status AND next_id = id

      UNION

    SELECT id, timestamp AS fail_end FROM cte
    WHERE next_status IS NULL AND status = 'Failed'
  ) y
)
-- simple join of row numbers of FS and FE to generate the required result
SELECT fs.id, fs.fail_start, fe.fail_end FROM fs JOIN fe ON fs.r = fe.r

在此解决方案中,我没有尝试计算更改列,而是对称地使用 LEAD 和 LAG 函数,在按 ID 分区并按时间戳排序时,首先将每条记录及其上一条和下一条记录对齐(按 ID 排序的替代解决方案,时间戳被证明用处不大),然后在专门编造的表中搜索状态开始和结束,以保证的方式揭示此模式。最好通过 Fiddle 中组件子查询的中间结果来理解。

上一页-下一页(LAG/LEAD 视图)

然后,我将任何 FAILED 开始时间戳前面都以非失败为前提的逻辑转换为 SQL,但时间戳日志中唯一的故障或首次出现除外。

失败启动

及时显示失败状态结束的镜像将如下所示

失败结束

最终答案只是按行号连接失败开始和失败结束查询,由于自我投影,这保证有效,并且还引入了按 ID 排序,以将结果与原始帖子和接受的答案完全匹配。

具体来说,这里最大的区别在于,这不会像帖子中最初设想的那样生成建议的更改列、求和和分组依据(以便从所需的结果向后工作)。

因此,我将保留先前的答案作为可接受的答案。

结果是相同的。在几个数据集中,性能被标记为“太接近,无法调用”,每个数据集都有数百万条状态记录,并具有多个状态枚举(而不仅仅是玩具数据集中的二进制“失败”和“活动”)。

此外,为了支持公认的答案,我仍然喜欢它,因为查询对我来说更容易阅读,也可能是其他人,因此可以理解、维护为代码。