识别 SQL 表中的最新步骤序列

Identifying latest sequence of steps in SQL table

提问人:idiocache 提问时间:3/8/2023 最后编辑:jarlhidiocache 更新时间:4/4/2023 访问量:70

问:

我有一个 SQL 表,其中包含两个不同配方的时间戳步骤序列。我试图找出一种识别“当前”或“最新”配方的方法。我的源表看起来有点像这样:

食谱 时间戳
1 2023-03-07 21:56:55 step_3
1 2023-03-07 21:56:50 step_3
1 2023-03-07 21:56:45 step_3
1 2023-03-07 21:56:40 step_2
1 2023-03-07 21:56:35 step_2
1 2023-03-07 21:56:30 step_2
1 2023-03-07 21:56:25 step_1
1 2023-03-07 21:56:20 step_1
1 2023-03-07 21:56:15 step_1
2 2023-03-07 21:56:55 step_3
2 2023-03-07 21:56:50 step_3
2 2023-03-07 21:56:45 step_3
2 2023-03-07 21:56:40 step_2
2 2023-03-07 21:56:35 step_2
2 2023-03-07 21:56:30 step_2
2 2023-03-07 21:56:25 step_1
2 2023-03-07 21:56:20 step_1
2 2023-03-07 21:56:15 step_1
1 2023-03-07 21:56:10 step_3
1 2023-03-07 21:56:05 step_3
1 2023-03-07 21:56:00 step_3
1 2023-03-07 21:55:55 step_2
1 2023-03-07 21:55:50 step_2
1 2023-03-07 21:55:45 step_2
1 2023-03-07 21:55:40 step_1
1 2023-03-07 21:55:35 step_1
1 2023-03-07 21:55:30 step_1

我所追求的状态看起来有点像这样:

食谱 时间戳 当前
1 2023-03-07 21:56:55 step_3
1 2023-03-07 21:56:50 step_3
1 2023-03-07 21:56:45 step_3
1 2023-03-07 21:56:40 step_2
1 2023-03-07 21:56:35 step_2
1 2023-03-07 21:56:30 step_2
1 2023-03-07 21:56:25 step_1
1 2023-03-07 21:56:20 step_1
1 2023-03-07 21:56:15 step_1
2 2023-03-07 21:56:55 step_3
2 2023-03-07 21:56:50 step_3
2 2023-03-07 21:56:45 step_3
2 2023-03-07 21:56:40 step_2
2 2023-03-07 21:56:35 step_2
2 2023-03-07 21:56:30 step_2
2 2023-03-07 21:56:25 step_1
2 2023-03-07 21:56:20 step_1
2 2023-03-07 21:56:15 step_1
1 2023-03-07 21:56:10 step_3
1 2023-03-07 21:56:05 step_3
1 2023-03-07 21:56:00 step_3
1 2023-03-07 21:55:55 step_2
1 2023-03-07 21:55:50 step_2
1 2023-03-07 21:55:45 step_2
1 2023-03-07 21:55:40 step_1
1 2023-03-07 21:55:35 step_1
1 2023-03-07 21:55:30 step_1

到目前为止,我已经设法生成了一个新列,该列使用以下代码显示每个步骤之间的秒数差异,按配方/步骤分区:

WITH differences AS (
SELECT
    *,
    TIMESTAMPDIFF(SECOND, timestamp, lag(timestamp) OVER (PARTITION BY recipe, step ORDER BY timestamp DESC)) AS difference
FROM recipe_steps
)

这将导致以下结果(仅显示配方 1 的步骤 1):

食谱 时间戳 差异
1 2023-03-07 21:56:25 step_1
1 2023-03-07 21:56:20 step_1 5
1 2023-03-07 21:56:15 step_1 5
1 2023-03-07 21:55:40 step_1 35
1 2023-03-07 21:55:35 step_1 5
1 2023-03-07 21:55:30 step_1 5

这个想法是,对于特定的配方,配方的当前/最新执行从最新一行之后的行开始,差异大于 5 秒:

食谱 时间戳 差异 当前
1 2023-03-07 21:56:25 step_1
1 2023-03-07 21:56:20 step_1 5
1 2023-03-07 21:56:15 step_1 5
1 2023-03-07 21:55:40 step_1 35
1 2023-03-07 21:55:35 step_1 5
1 2023-03-07 21:55:30 step_1 5

经过几个小时的研究,我对如何从“差异”列推导出“当前”列完全不知所措,这就是这个问题的原因。

我尝试了各种从“差异”列派生“当前”列的方法,包括使用 和 窗口函数,但没有成功。row_number()

SQL Databricks

评论

0赞 Community 3/8/2023
请澄清您的具体问题或提供其他详细信息以准确说明您的需求。正如目前所写的那样,很难确切地说出你在问什么。

答:

0赞 Abdulmajeed 3/8/2023 #1

您需要按配方和步骤对行进行分区,然后使用 ROW_NUMBER() 窗口函数为其分区中的每一行分配一个编号,按时间戳降序排序

  SELECT 
recipe, timestamp, step,
CASE
    WHEN ROW_NUMBER() OVER (PARTITION BY recipe, step ORDER BY timestamp DESC) = 1 THEN 'true'
    ELSE 'false'
END AS current

从 recipe_steps ORDER BY 配方,时间戳 DESC;

评论

0赞 idiocache 3/8/2023
这根本行不通。
0赞 Abdulmajeed 3/8/2023
请立即检查
0赞 Tom Boyd 3/8/2023 #2

你正朝着正确的方向前进。

从您的差异 cte 开始,我将 FALSE 分配给> 5 个时间戳。

SELECT *,
       CASE WHEN difference > 5 THEN 'FALSE' ELSE 'TRUE' END AS firstFalse
  FROM diff

使用自连接,我将 FALSE 分配给“firstFalse”之前的每个日期。这将返回所有为 false 的时间戳。

   SELECT f1.recipe,
          f1.step,
          MIN(f2.t_stamp) firstFalse,
          f1.t_stamp,
          'FALSE' AS Curr
     FROM false1 f1
     JOIN false1 f2 ON f1.recipe = f2.recipe AND f1.step = f2.step AND f2.firstFalse = 'FALSE'
    WHERE f1.t_stamp <= f2.t_stamp
 GROUP BY f1.recipe, f1.step, f1.t_stamp

最后,我用左联接把它们放在一起。

SELECT f1.recipe,
       f1.t_stamp AS [timestamp],
       f1.step,
       COALESCE(f2.curr, 'TRUE') AS [current]  
  FROM false1 f1
  LEFT JOIN false2 f2 ON f1.recipe = f2.recipe AND f1.step = f2.step AND f1.t_stamp = f2.t_stamp 
 ORDER BY [current] desc, f1.recipe asc, f1.step asc, f1.t_stamp asc

小提琴的整个过程

WITH diff AS 
  (SELECT recipe,
          t_stamp,
          step,
          DATEDIFF(SECOND, t_stamp, lag(t_stamp) OVER (PARTITION BY recipe, step ORDER BY t_stamp DESC)) AS difference
     FROM recipes
  )
, false1 AS
  (SELECT *,
          CASE WHEN difference > 5 THEN 'FALSE' ELSE 'TRUE' END AS firstFalse
     FROM diff
  )
, false2 AS
  (
   SELECT f1.recipe,
          f1.step,
          MIN(f2.t_stamp) firstFalse,
          f1.t_stamp,
          'FALSE' AS Curr
     FROM false1 f1
     JOIN false1 f2 ON f1.recipe = f2.recipe AND f1.step = f2.step AND f2.firstFalse = 'FALSE'
    WHERE f1.t_stamp <= f2.t_stamp
 GROUP BY f1.recipe, f1.step, f1.t_stamp
  )
  
SELECT f1.recipe,
       f1.t_stamp AS [timestamp],
       f1.step,
       COALESCE(f2.curr, 'TRUE') AS [current]  
  FROM false1 f1
  LEFT JOIN false2 f2 ON f1.recipe = f2.recipe AND f1.step = f2.step AND f1.t_stamp = f2.t_stamp 
 ORDER BY [current] desc, f1.recipe asc, f1.step asc, f1.t_stamp asc
0赞 Junjie 3/8/2023 #3

您可以使用窗口函数和一些匹配技巧通过几个步骤获得所需的结果:

步骤 1。根据您的定义,使用窗口函数计算配方重新启动时的提取行,这意味着它是配方的第 1 步行,并且它与上一步的时间差大于 5 秒。

第 2 步:使用 windwo 函数并标记recipe_restart点来获得recipe_restart_bucket,因为您想按降序获得结果,因此我们做了一个数学技巧来按降序将 sum(#restart) - sum (#seenrestartsofar) 转换为桶索引。

第 3 步:当 bucket_index=0 时添加 current_flag(最近一个按降序排列)

以下是所有步骤的查询

with recipe_steps_time_diff AS (
select
    recipe,
    timestamp,
    step,
    case when step <> 'step_1' or datediff(SECOND, lag(timestamp) over(partition by recipe,step order by timestamp), timestamp) <=5 then 0 else 1 end as recipe_restart 
from
   recipe_steps
),
recipe_steps_restart_bucket AS (
select
    recipe,
    timestamp,
    step,
    sum(recipe_restart) over (partition by recipe) - sum(recipe_restart) over (partition by recipe order by timestamp) as recipe_restart_bucket
from
    recipe_steps_time_diff
)
select
    recipe,
    timestamp,
    step,
    case when recipe_restart_bucket = 0 then 'True' else 'False' end as current_flag   
from
    recipe_steps_restart_bucket
order by recipe_restart_bucket, recipe, timestamp desc
食谱 时间戳 current_flag
1 2023-03-07T21:56:55Z step_3
1 2023-03-07T21:56:50Z step_3
1 2023-03-07T21:56:45Z step_3
1 2023-03-07T21:56:40Z step_2
1 2023-03-07T21:56:35Z step_2
1 2023-03-07T21:56:30Z step_2
1 2023-03-07T21:56:25Z step_1
1 2023-03-07T21:56:20Z step_1
1 2023-03-07T21:56:15Z step_1
2 2023-03-07T21:56:55Z step_3
2 2023-03-07T21:56:50Z step_3
2 2023-03-07T21:56:45Z step_3
2 2023-03-07T21:56:40Z step_2
2 2023-03-07T21:56:35Z step_2
2 2023-03-07T21:56:30Z step_2
2 2023-03-07T21:56:25Z step_1
2 2023-03-07T21:56:20Z step_1
2 2023-03-07T21:56:15Z step_1
1 2023-03-07T21:56:10Z step_3
1 2023-03-07T21:56:05Z step_3
1 2023-03-07T21:56:00Z step_3
1 2023-03-07T21:55:55Z step_2
1 2023-03-07T21:55:50Z step_2
1 2023-03-07T21:55:45Z step_2
1 2023-03-07T21:55:40Z step_1
1 2023-03-07T21:55:35Z step_1
1 2023-03-07T21:55:30Z step_1

评论

0赞 idiocache 3/8/2023
这太棒了,谢谢。我唯一需要修改的是 NULL 步骤的处理(我在最初的问题中没有指定为可能性)。为此,我只需将 NULL 步骤转换为“no_step”,以便求和逻辑可以处理这些步骤。