如何标记 pl。系列使用两个pl。DataFrame 未与 null 值联接?

How to label a pl.Series using two pl.DataFrame not joined with null values?

提问人:Olivier D'Ancona 提问时间:10/29/2023 最后编辑:Olivier D'Ancona 更新时间:11/1/2023 访问量:93

问:

问题

我有pl。DataFrame df_signals

┌──────────────┬──────┬─────────────────────────┬──────────┬──────────┐
│ series_id    ┆ step ┆ timestamp               ┆ sig1     ┆ sig2     │
│ ---          ┆ ---  ┆ ---                     ┆ ---      ┆ ---      │
│ str          ┆ u32  ┆ datetime[μs, UTC]       ┆ f32      ┆ f32      │
╞══════════════╪══════╪═════════════════════════╪══════════╪══════════╡
│ 038441c925bb ┆ 0    ┆ 2018-08-14 19:30:00 UTC ┆ 0.550596 ┆ 0.017739 │
│ f8a8de8bdd00 ┆ 0    ┆ 2018-08-14 19:30:00 UTC ┆ 0.220596 ┆ 0.077739 │
│ …            ┆ …    ┆ …                       ┆ …        ┆ …        │
└──────────────┴──────┴─────────────────────────┴──────────┴──────────┘

和另一个pl。DataFrame df_events


┌──────────────┬───────┬────────┬───────┬─────────────────────────┬────────────┬─────────────┐
│ series_id    ┆ night ┆ event  ┆ step  ┆ timestamp               ┆ onset_step ┆ wakeup_step │
│ ---          ┆ ---   ┆ ---    ┆ ---   ┆ ---                     ┆ ---        ┆ ---         │
│ str          ┆ i64   ┆ str    ┆ u32   ┆ datetime[μs, UTC]       ┆ u32        ┆ u32         │
╞══════════════╪═══════╪════════╪═══════╪═════════════════════════╪════════════╪═════════════╡
│ 038441c925bb ┆ 1     ┆ onset  ┆ 4879  ┆ 2018-08-15 02:26:00 UTC ┆ 4879       ┆ null        │
│ 038441c925bb ┆ 1     ┆ wakeup ┆ 10932 ┆ 2018-08-15 10:41:00 UTC ┆ null       ┆ 10932       │
│ …            ┆ …     ┆ …      ┆ …     ┆ …                       ┆ …          ┆ …           │
└──────────────┴───────┴────────┴───────┴─────────────────────────┴────────────┴─────────────┘

我想设置一个新列“状态”,它应该是:

  • 如果series_id处于唤醒状态,则为 0
  • 1 如果series_id正在睡觉

这看起来很容易,但我无法有效地计算它。

我怎样才能解决这个问题,因为我有很多数据,所以尽量减少计算?


复制

我做了人工数据:

3 系列 ['A','B','C']

  • 从睡梦中
2022-01-01 02:00:00 -> 2022-01-01 14:00:00 | [3-15[
2022-01-01 22:00:00 -> 2022-01-02 08:00:00 | [23-33[
  • B 睡眠从
2022-01-01 03:00:00 -> 2022-01-01 15:00:00 [4-16[
  • C 开始睡觉(但永远不要醒来...... =>不匹配)
2022-01-01 04:00:00 [5-?[
timestamps = ([f"2022-01-01 {hour:02d}:00:00" for hour in range(24)] + 
             ["2022-01-02 00:00:00"] +
             [f"2022-01-02 {hour:02d}:00:00" for hour in range(1, 13)])

df_signals = pl.DataFrame({
    "series_id": ["A"] * 37 + ["B"] * 37 + ["C"] * 37,
    "timestamp": timestamps * 3,
    "step": list(range(1, 38)) * 3
})

# Extended events data
df_events = pl.DataFrame({
    "series_id": ["A", "A", "A", "A", "B", "B", "C"],
    "night": [1, 1, 1, 2, 1, 1, 1],
    "event": ["onset", "wakeup", "onset", "wakeup", "onset", "wakeup", "onset"],
    "timestamp": ["2022-01-01 02:00:00", "2022-01-01 14:00:00","2022-01-01 22:00:00", "2022-01-02 08:00:00", "2022-01-01 03:00:00",
                 "2022-01-01 15:00:00", "2022-01-01 04:00:00"],
    "step": [3, 15, 23, 33, 4, 16, 5]
})

这是我尝试过的:

df_events = df_events.with_columns(
    onset_step = pl.when(pl.col('event') == 'onset').then(pl.col('step')),
    wakeup_step = pl.when(pl.col('event') == 'wakeup').then(pl.col('step'))
)
df = df_signals.join(df_events, on=['series_id', 'timestamp', 'step'], how='left')
df = df.sort(['series_id', 'step'])
df = df.with_columns(
    onset_step = pl.col('onset_step').forward_fill(),
    wakeup_step = pl.col('wakeup_step').forward_fill()
)
df.with_columns(
    state = (pl.col('step') >= pl.col('onset_step')) & (pl.col('step') <= pl.col('wakeup_step')).fill_null(False)
)

但是,我不确定如何处理边缘情况...... 当我使用 forward_fill() 时,它会在开始时中断,当我使用 backward_fill() 时,它会在结束时中断......

预期结果

series_id,timestamp,step,state,event
A,2022-01-01T00:00:00.000000,1,0,
A,2022-01-01T01:00:00.000000,2,0,
A,2022-01-01T02:00:00.000000,3,1,onset
A,2022-01-01T03:00:00.000000,4,1,
A,2022-01-01T04:00:00.000000,5,1,
A,2022-01-01T05:00:00.000000,6,1,
A,2022-01-01T06:00:00.000000,7,1,
A,2022-01-01T07:00:00.000000,8,1,
A,2022-01-01T08:00:00.000000,9,1,
A,2022-01-01T09:00:00.000000,10,1,
A,2022-01-01T10:00:00.000000,11,1,
A,2022-01-01T11:00:00.000000,12,1,
A,2022-01-01T12:00:00.000000,13,1,
A,2022-01-01T13:00:00.000000,14,1,
A,2022-01-01T14:00:00.000000,15,0,wakeup
A,2022-01-01T15:00:00.000000,16,0,
A,2022-01-01T16:00:00.000000,17,0,
A,2022-01-01T17:00:00.000000,18,0,
A,2022-01-01T18:00:00.000000,19,0,
A,2022-01-01T19:00:00.000000,20,0,
A,2022-01-01T20:00:00.000000,21,0,
A,2022-01-01T21:00:00.000000,22,0,
A,2022-01-01T22:00:00.000000,23,1,onset
A,2022-01-01T23:00:00.000000,24,1,
A,2022-01-02T00:00:00.000000,25,1,
A,2022-01-02T01:00:00.000000,26,1,
A,2022-01-02T02:00:00.000000,27,1,
A,2022-01-02T03:00:00.000000,28,1,
A,2022-01-02T04:00:00.000000,29,1,
A,2022-01-02T05:00:00.000000,30,1,
A,2022-01-02T06:00:00.000000,31,1,
A,2022-01-02T07:00:00.000000,32,1,
A,2022-01-02T08:00:00.000000,33,0,wakeup
A,2022-01-02T09:00:00.000000,34,0,
A,2022-01-02T10:00:00.000000,35,0,
A,2022-01-02T11:00:00.000000,36,0,
A,2022-01-02T12:00:00.000000,37,0,
B,2022-01-01T00:00:00.000000,1,0,
B,2022-01-01T01:00:00.000000,2,0,
B,2022-01-01T02:00:00.000000,3,0,
B,2022-01-01T03:00:00.000000,4,1,onset
B,2022-01-01T04:00:00.000000,5,1,
B,2022-01-01T05:00:00.000000,6,1,
B,2022-01-01T06:00:00.000000,7,1,
B,2022-01-01T07:00:00.000000,8,1,
B,2022-01-01T08:00:00.000000,9,1,
B,2022-01-01T09:00:00.000000,10,1,
B,2022-01-01T10:00:00.000000,11,1,
B,2022-01-01T11:00:00.000000,12,1,
B,2022-01-01T12:00:00.000000,13,1,
B,2022-01-01T13:00:00.000000,14,1,
B,2022-01-01T14:00:00.000000,15,1,
B,2022-01-01T15:00:00.000000,16,0,wakeup
B,2022-01-01T16:00:00.000000,17,0,
B,2022-01-01T17:00:00.000000,18,0,
B,2022-01-01T18:00:00.000000,19,0,
B,2022-01-01T19:00:00.000000,20,0,
B,2022-01-01T20:00:00.000000,21,0,
B,2022-01-01T21:00:00.000000,22,0,
B,2022-01-01T22:00:00.000000,23,0,
B,2022-01-01T23:00:00.000000,24,0,
B,2022-01-02T00:00:00.000000,25,0,
B,2022-01-02T01:00:00.000000,26,0,
B,2022-01-02T02:00:00.000000,27,0,
B,2022-01-02T03:00:00.000000,28,0,
B,2022-01-02T04:00:00.000000,29,0,
B,2022-01-02T05:00:00.000000,30,0,
B,2022-01-02T06:00:00.000000,31,0,
B,2022-01-02T07:00:00.000000,32,0,
B,2022-01-02T08:00:00.000000,33,0,
B,2022-01-02T09:00:00.000000,34,0,
B,2022-01-02T10:00:00.000000,35,0,
B,2022-01-02T11:00:00.000000,36,0,
B,2022-01-02T12:00:00.000000,37,0,
C,2022-01-01T00:00:00.000000,1,0,
C,2022-01-01T01:00:00.000000,2,0,
C,2022-01-01T02:00:00.000000,3,0,
C,2022-01-01T03:00:00.000000,4,0,
C,2022-01-01T04:00:00.000000,5,0,onset
C,2022-01-01T05:00:00.000000,6,0,
C,2022-01-01T06:00:00.000000,7,0,
C,2022-01-01T07:00:00.000000,8,0,
C,2022-01-01T08:00:00.000000,9,0,
C,2022-01-01T09:00:00.000000,10,0,
C,2022-01-01T10:00:00.000000,11,0,
C,2022-01-01T11:00:00.000000,12,0,
C,2022-01-01T12:00:00.000000,13,0,
C,2022-01-01T13:00:00.000000,14,0,
C,2022-01-01T14:00:00.000000,15,0,
C,2022-01-01T15:00:00.000000,16,0,
C,2022-01-01T16:00:00.000000,17,0,
C,2022-01-01T17:00:00.000000,18,0,
C,2022-01-01T18:00:00.000000,19,0,
C,2022-01-01T19:00:00.000000,20,0,
C,2022-01-01T20:00:00.000000,21,0,
C,2022-01-01T21:00:00.000000,22,0,
C,2022-01-01T22:00:00.000000,23,0,
C,2022-01-01T23:00:00.000000,24,0,
C,2022-01-02T00:00:00.000000,25,0,
C,2022-01-02T01:00:00.000000,26,0,
C,2022-01-02T02:00:00.000000,27,0,
C,2022-01-02T03:00:00.000000,28,0,
C,2022-01-02T04:00:00.000000,29,0,
C,2022-01-02T05:00:00.000000,30,0,
C,2022-01-02T06:00:00.000000,31,0,
C,2022-01-02T07:00:00.000000,32,0,
C,2022-01-02T08:00:00.000000,33,0,
C,2022-01-02T09:00:00.000000,34,0,
C,2022-01-02T10:00:00.000000,35,0,
C,2022-01-02T11:00:00.000000,36,0,
C,2022-01-02T12:00:00.000000,37,0,

数据帧 日期时间 python-polars

评论

1赞 jqurious 10/29/2023
看起来这可能是一个类型问题: stackoverflow.com/questions/73176563 - 如果您能给出一个可运行的代码示例,就像该问题中一样,并显示实际的最终结果 - 它会更容易提供帮助。.join_asof
0赞 Hericks 10/30/2023
在 MRE 中获取预期结果(硬编码)会很有帮助。
0赞 jqurious 11/1/2023
感谢您的更新。 睡眠,但输出列是(唤醒? - 我是否误解了描述?输出间隔为例如 是“封闭的”和“开放的”。活动是否保证有序/配对?即,如果 A 存在,它总是直接出现在前一个事件之后?总是连续的,没有间隙吗?A2022-01-01 02:00:00 -> 2022-01-01 14:00:00state1A[[3, 15], [16, 22], [23, 33]]offset -> wakeupwakeup -> offsetwakeuponsetsteps
0赞 Olivier D'Ancona 11/1/2023
您好,我更新了预期结果以使其更具体。我颠倒了醒来和睡觉,并相应地修复了它。可能存在一些不匹配...仅开始或仅唤醒......更简单的方法是在治疗前修剪它们。其余数据是一致的。这些步骤是一致的,没有间隙

答:

1赞 Dean MacGregor 10/30/2023 #1

这确实是join_asof要处理的问题类型。

您只需要将列转换为 a,从您尝试的正向和后向填充来看,该步骤似乎不应该用于连接。timestampDatetime

df_signals=df_signals.with_columns(
    timestamp=pl.col('timestamp').str.strptime(pl.Datetime,"%Y-%m-%d %H:%M:%S")
    )
df_events=df_events.with_columns(
    timestamp=pl.col('timestamp').str.strptime(pl.Datetime,"%Y-%m-%d %H:%M:%S")
    )

(
    df_signals
    .join_asof(
        df_events.drop('step'), on='timestamp', by='series_id'
        )
    .with_columns(
        state=(pl.col('event')=='wakeup').cast(pl.Int32)
        )
)

这假设“醒来”意味着清醒,其他任何东西都意味着睡着了。您必须调整该布尔表达式以满足额外的情况。它还假定没有默认值,也就是说,如果信号中有一些时间在事件发生之前,它们将为空。更改默认值也只是调整 bool 表达式的问题。

1赞 jqurious 11/1/2023 #2

如果可以将两个数据集和.sort()nulls_last=True

df = (
   pl.concat(
       [
          df_signals,
          df_events
       ], 
       how = "diagonal"
   )
   .sort("series_id", "timestamp", "night", nulls_last=True)
)
pl.Config(tbl_rows = 16)
df.head(16)
shape: (16, 5)
┌───────────┬─────────────────────┬──────┬───────┬────────┐
│ series_id ┆ timestamp           ┆ step ┆ night ┆ event  │
│ ---       ┆ ---                 ┆ ---  ┆ ---   ┆ ---    │
│ str       ┆ datetime[μs]        ┆ i64  ┆ i64   ┆ str    │
╞═══════════╪═════════════════════╪══════╪═══════╪════════╡
│ A         ┆ 2022-01-01 00:00:00 ┆ 1    ┆ null  ┆ null   │
│ A         ┆ 2022-01-01 01:00:00 ┆ 2    ┆ null  ┆ null   │
│ A         ┆ 2022-01-01 02:00:00 ┆ 3    ┆ 1     ┆ onset  │
│ A         ┆ 2022-01-01 02:00:00 ┆ 3    ┆ null  ┆ null   │
│ A         ┆ 2022-01-01 03:00:00 ┆ 4    ┆ null  ┆ null   │
│ A         ┆ 2022-01-01 04:00:00 ┆ 5    ┆ null  ┆ null   │
│ A         ┆ 2022-01-01 05:00:00 ┆ 6    ┆ null  ┆ null   │
│ A         ┆ 2022-01-01 06:00:00 ┆ 7    ┆ null  ┆ null   │
│ A         ┆ 2022-01-01 07:00:00 ┆ 8    ┆ null  ┆ null   │
│ A         ┆ 2022-01-01 08:00:00 ┆ 9    ┆ null  ┆ null   │
│ A         ┆ 2022-01-01 09:00:00 ┆ 10   ┆ null  ┆ null   │
│ A         ┆ 2022-01-01 10:00:00 ┆ 11   ┆ null  ┆ null   │
│ A         ┆ 2022-01-01 11:00:00 ┆ 12   ┆ null  ┆ null   │
│ A         ┆ 2022-01-01 12:00:00 ┆ 13   ┆ null  ┆ null   │
│ A         ┆ 2022-01-01 13:00:00 ┆ 14   ┆ null  ┆ null   │
│ A         ┆ 2022-01-01 14:00:00 ┆ 15   ┆ 1     ┆ wakeup │
└───────────┴─────────────────────┴──────┴───────┴────────┘

我们可以 然后删除行。.backward_fill()eventnight = 1

df = (
   df.with_columns(
      pl.col("event").forward_fill(limit=1), # take event from `night = 1` rows
      pl.col("event").backward_fill().over("series_id")
        .map_dict({"onset": 0, "wakeup": 1})
        .fill_null(0)
        .alias("state")
     )
     .filter(night = None)
)

df.select(expected.columns).frame_equal(expected)
# True

评论

0赞 Olivier D'Ancona 11/9/2023
谢谢,这是在我的数据集上工作的,您的解决方案在 27.4 秒内运行,而另一个具有 join_asof 的解决方案需要 5 秒
0赞 jqurious 11/9/2023
@OlivierD'Ancona:感谢您的更新。是的,但两种解决方案会产生不同的结果。我发布这种方法的原因是因为它产生了预期的输出。在比较它们的运行时之前,您需要添加额外的步骤。
0赞 Olivier D'Ancona 11/9/2023
事实上,我能够使用以前的方法对不匹配事件进行事先过滤
0赞 jqurious 11/9/2023
总体来说,起点仍然可能导致更快的方法 - 我没有进一步探索这种方法。我只是注意到,当我以这种方式组合所有数据时,使用填充获得最终输出并不多。.json_asof