提问人:Olivier D'Ancona 提问时间:10/29/2023 最后编辑:Olivier D'Ancona 更新时间:11/1/2023 访问量:93
如何标记 pl。系列使用两个pl。DataFrame 未与 null 值联接?
How to label a pl.Series using two pl.DataFrame not joined with null values?
问:
问题
我有pl。DataFrame df_signals
┌──────────────┬──────┬─────────────────────────┬──────────┬──────────┐
│ series_id ┆ step ┆ timestamp ┆ sig1 ┆ sig2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ datetime[μs, UTC] ┆ f32 ┆ f32 │
╞══════════════╪══════╪═════════════════════════╪══════════╪══════════╡
│ 038441c925bb ┆ 0 ┆ 2018-08-14 19:30:00 UTC ┆ 0.550596 ┆ 0.017739 │
│ f8a8de8bdd00 ┆ 0 ┆ 2018-08-14 19:30:00 UTC ┆ 0.220596 ┆ 0.077739 │
│ … ┆ … ┆ … ┆ … ┆ … │
└──────────────┴──────┴─────────────────────────┴──────────┴──────────┘
和另一个pl。DataFrame df_events
┌──────────────┬───────┬────────┬───────┬─────────────────────────┬────────────┬─────────────┐
│ series_id ┆ night ┆ event ┆ step ┆ timestamp ┆ onset_step ┆ wakeup_step │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ u32 ┆ datetime[μs, UTC] ┆ u32 ┆ u32 │
╞══════════════╪═══════╪════════╪═══════╪═════════════════════════╪════════════╪═════════════╡
│ 038441c925bb ┆ 1 ┆ onset ┆ 4879 ┆ 2018-08-15 02:26:00 UTC ┆ 4879 ┆ null │
│ 038441c925bb ┆ 1 ┆ wakeup ┆ 10932 ┆ 2018-08-15 10:41:00 UTC ┆ null ┆ 10932 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
└──────────────┴───────┴────────┴───────┴─────────────────────────┴────────────┴─────────────┘
我想设置一个新列“状态”,它应该是:
- 如果series_id处于唤醒状态,则为 0
- 1 如果series_id正在睡觉
这看起来很容易,但我无法有效地计算它。
我怎样才能解决这个问题,因为我有很多数据,所以尽量减少计算?
复制
我做了人工数据:
3 系列 ['A','B','C']
- 从睡梦中
2022-01-01 02:00:00 -> 2022-01-01 14:00:00 | [3-15[
2022-01-01 22:00:00 -> 2022-01-02 08:00:00 | [23-33[
- B 睡眠从
2022-01-01 03:00:00 -> 2022-01-01 15:00:00 [4-16[
- C 开始睡觉(但永远不要醒来...... =>不匹配)
2022-01-01 04:00:00 [5-?[
timestamps = ([f"2022-01-01 {hour:02d}:00:00" for hour in range(24)] +
["2022-01-02 00:00:00"] +
[f"2022-01-02 {hour:02d}:00:00" for hour in range(1, 13)])
df_signals = pl.DataFrame({
"series_id": ["A"] * 37 + ["B"] * 37 + ["C"] * 37,
"timestamp": timestamps * 3,
"step": list(range(1, 38)) * 3
})
# Extended events data
df_events = pl.DataFrame({
"series_id": ["A", "A", "A", "A", "B", "B", "C"],
"night": [1, 1, 1, 2, 1, 1, 1],
"event": ["onset", "wakeup", "onset", "wakeup", "onset", "wakeup", "onset"],
"timestamp": ["2022-01-01 02:00:00", "2022-01-01 14:00:00","2022-01-01 22:00:00", "2022-01-02 08:00:00", "2022-01-01 03:00:00",
"2022-01-01 15:00:00", "2022-01-01 04:00:00"],
"step": [3, 15, 23, 33, 4, 16, 5]
})
这是我尝试过的:
df_events = df_events.with_columns(
onset_step = pl.when(pl.col('event') == 'onset').then(pl.col('step')),
wakeup_step = pl.when(pl.col('event') == 'wakeup').then(pl.col('step'))
)
df = df_signals.join(df_events, on=['series_id', 'timestamp', 'step'], how='left')
df = df.sort(['series_id', 'step'])
df = df.with_columns(
onset_step = pl.col('onset_step').forward_fill(),
wakeup_step = pl.col('wakeup_step').forward_fill()
)
df.with_columns(
state = (pl.col('step') >= pl.col('onset_step')) & (pl.col('step') <= pl.col('wakeup_step')).fill_null(False)
)
但是,我不确定如何处理边缘情况...... 当我使用 forward_fill() 时,它会在开始时中断,当我使用 backward_fill() 时,它会在结束时中断......
预期结果
series_id,timestamp,step,state,event
A,2022-01-01T00:00:00.000000,1,0,
A,2022-01-01T01:00:00.000000,2,0,
A,2022-01-01T02:00:00.000000,3,1,onset
A,2022-01-01T03:00:00.000000,4,1,
A,2022-01-01T04:00:00.000000,5,1,
A,2022-01-01T05:00:00.000000,6,1,
A,2022-01-01T06:00:00.000000,7,1,
A,2022-01-01T07:00:00.000000,8,1,
A,2022-01-01T08:00:00.000000,9,1,
A,2022-01-01T09:00:00.000000,10,1,
A,2022-01-01T10:00:00.000000,11,1,
A,2022-01-01T11:00:00.000000,12,1,
A,2022-01-01T12:00:00.000000,13,1,
A,2022-01-01T13:00:00.000000,14,1,
A,2022-01-01T14:00:00.000000,15,0,wakeup
A,2022-01-01T15:00:00.000000,16,0,
A,2022-01-01T16:00:00.000000,17,0,
A,2022-01-01T17:00:00.000000,18,0,
A,2022-01-01T18:00:00.000000,19,0,
A,2022-01-01T19:00:00.000000,20,0,
A,2022-01-01T20:00:00.000000,21,0,
A,2022-01-01T21:00:00.000000,22,0,
A,2022-01-01T22:00:00.000000,23,1,onset
A,2022-01-01T23:00:00.000000,24,1,
A,2022-01-02T00:00:00.000000,25,1,
A,2022-01-02T01:00:00.000000,26,1,
A,2022-01-02T02:00:00.000000,27,1,
A,2022-01-02T03:00:00.000000,28,1,
A,2022-01-02T04:00:00.000000,29,1,
A,2022-01-02T05:00:00.000000,30,1,
A,2022-01-02T06:00:00.000000,31,1,
A,2022-01-02T07:00:00.000000,32,1,
A,2022-01-02T08:00:00.000000,33,0,wakeup
A,2022-01-02T09:00:00.000000,34,0,
A,2022-01-02T10:00:00.000000,35,0,
A,2022-01-02T11:00:00.000000,36,0,
A,2022-01-02T12:00:00.000000,37,0,
B,2022-01-01T00:00:00.000000,1,0,
B,2022-01-01T01:00:00.000000,2,0,
B,2022-01-01T02:00:00.000000,3,0,
B,2022-01-01T03:00:00.000000,4,1,onset
B,2022-01-01T04:00:00.000000,5,1,
B,2022-01-01T05:00:00.000000,6,1,
B,2022-01-01T06:00:00.000000,7,1,
B,2022-01-01T07:00:00.000000,8,1,
B,2022-01-01T08:00:00.000000,9,1,
B,2022-01-01T09:00:00.000000,10,1,
B,2022-01-01T10:00:00.000000,11,1,
B,2022-01-01T11:00:00.000000,12,1,
B,2022-01-01T12:00:00.000000,13,1,
B,2022-01-01T13:00:00.000000,14,1,
B,2022-01-01T14:00:00.000000,15,1,
B,2022-01-01T15:00:00.000000,16,0,wakeup
B,2022-01-01T16:00:00.000000,17,0,
B,2022-01-01T17:00:00.000000,18,0,
B,2022-01-01T18:00:00.000000,19,0,
B,2022-01-01T19:00:00.000000,20,0,
B,2022-01-01T20:00:00.000000,21,0,
B,2022-01-01T21:00:00.000000,22,0,
B,2022-01-01T22:00:00.000000,23,0,
B,2022-01-01T23:00:00.000000,24,0,
B,2022-01-02T00:00:00.000000,25,0,
B,2022-01-02T01:00:00.000000,26,0,
B,2022-01-02T02:00:00.000000,27,0,
B,2022-01-02T03:00:00.000000,28,0,
B,2022-01-02T04:00:00.000000,29,0,
B,2022-01-02T05:00:00.000000,30,0,
B,2022-01-02T06:00:00.000000,31,0,
B,2022-01-02T07:00:00.000000,32,0,
B,2022-01-02T08:00:00.000000,33,0,
B,2022-01-02T09:00:00.000000,34,0,
B,2022-01-02T10:00:00.000000,35,0,
B,2022-01-02T11:00:00.000000,36,0,
B,2022-01-02T12:00:00.000000,37,0,
C,2022-01-01T00:00:00.000000,1,0,
C,2022-01-01T01:00:00.000000,2,0,
C,2022-01-01T02:00:00.000000,3,0,
C,2022-01-01T03:00:00.000000,4,0,
C,2022-01-01T04:00:00.000000,5,0,onset
C,2022-01-01T05:00:00.000000,6,0,
C,2022-01-01T06:00:00.000000,7,0,
C,2022-01-01T07:00:00.000000,8,0,
C,2022-01-01T08:00:00.000000,9,0,
C,2022-01-01T09:00:00.000000,10,0,
C,2022-01-01T10:00:00.000000,11,0,
C,2022-01-01T11:00:00.000000,12,0,
C,2022-01-01T12:00:00.000000,13,0,
C,2022-01-01T13:00:00.000000,14,0,
C,2022-01-01T14:00:00.000000,15,0,
C,2022-01-01T15:00:00.000000,16,0,
C,2022-01-01T16:00:00.000000,17,0,
C,2022-01-01T17:00:00.000000,18,0,
C,2022-01-01T18:00:00.000000,19,0,
C,2022-01-01T19:00:00.000000,20,0,
C,2022-01-01T20:00:00.000000,21,0,
C,2022-01-01T21:00:00.000000,22,0,
C,2022-01-01T22:00:00.000000,23,0,
C,2022-01-01T23:00:00.000000,24,0,
C,2022-01-02T00:00:00.000000,25,0,
C,2022-01-02T01:00:00.000000,26,0,
C,2022-01-02T02:00:00.000000,27,0,
C,2022-01-02T03:00:00.000000,28,0,
C,2022-01-02T04:00:00.000000,29,0,
C,2022-01-02T05:00:00.000000,30,0,
C,2022-01-02T06:00:00.000000,31,0,
C,2022-01-02T07:00:00.000000,32,0,
C,2022-01-02T08:00:00.000000,33,0,
C,2022-01-02T09:00:00.000000,34,0,
C,2022-01-02T10:00:00.000000,35,0,
C,2022-01-02T11:00:00.000000,36,0,
C,2022-01-02T12:00:00.000000,37,0,
答:
1赞
Dean MacGregor
10/30/2023
#1
这确实是join_asof
要处理的问题类型。
您只需要将列转换为 a,从您尝试的正向和后向填充来看,该步骤似乎不应该用于连接。timestamp
Datetime
df_signals=df_signals.with_columns(
timestamp=pl.col('timestamp').str.strptime(pl.Datetime,"%Y-%m-%d %H:%M:%S")
)
df_events=df_events.with_columns(
timestamp=pl.col('timestamp').str.strptime(pl.Datetime,"%Y-%m-%d %H:%M:%S")
)
(
df_signals
.join_asof(
df_events.drop('step'), on='timestamp', by='series_id'
)
.with_columns(
state=(pl.col('event')=='wakeup').cast(pl.Int32)
)
)
这假设“醒来”意味着清醒,其他任何东西都意味着睡着了。您必须调整该布尔表达式以满足额外的情况。它还假定没有默认值,也就是说,如果信号中有一些时间在事件发生之前,它们将为空。更改默认值也只是调整 bool 表达式的问题。
1赞
jqurious
11/1/2023
#2
如果可以将两个数据集和.sort()
nulls_last=True
df = (
pl.concat(
[
df_signals,
df_events
],
how = "diagonal"
)
.sort("series_id", "timestamp", "night", nulls_last=True)
)
pl.Config(tbl_rows = 16)
df.head(16)
shape: (16, 5)
┌───────────┬─────────────────────┬──────┬───────┬────────┐
│ series_id ┆ timestamp ┆ step ┆ night ┆ event │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ i64 ┆ i64 ┆ str │
╞═══════════╪═════════════════════╪══════╪═══════╪════════╡
│ A ┆ 2022-01-01 00:00:00 ┆ 1 ┆ null ┆ null │
│ A ┆ 2022-01-01 01:00:00 ┆ 2 ┆ null ┆ null │
│ A ┆ 2022-01-01 02:00:00 ┆ 3 ┆ 1 ┆ onset │
│ A ┆ 2022-01-01 02:00:00 ┆ 3 ┆ null ┆ null │
│ A ┆ 2022-01-01 03:00:00 ┆ 4 ┆ null ┆ null │
│ A ┆ 2022-01-01 04:00:00 ┆ 5 ┆ null ┆ null │
│ A ┆ 2022-01-01 05:00:00 ┆ 6 ┆ null ┆ null │
│ A ┆ 2022-01-01 06:00:00 ┆ 7 ┆ null ┆ null │
│ A ┆ 2022-01-01 07:00:00 ┆ 8 ┆ null ┆ null │
│ A ┆ 2022-01-01 08:00:00 ┆ 9 ┆ null ┆ null │
│ A ┆ 2022-01-01 09:00:00 ┆ 10 ┆ null ┆ null │
│ A ┆ 2022-01-01 10:00:00 ┆ 11 ┆ null ┆ null │
│ A ┆ 2022-01-01 11:00:00 ┆ 12 ┆ null ┆ null │
│ A ┆ 2022-01-01 12:00:00 ┆ 13 ┆ null ┆ null │
│ A ┆ 2022-01-01 13:00:00 ┆ 14 ┆ null ┆ null │
│ A ┆ 2022-01-01 14:00:00 ┆ 15 ┆ 1 ┆ wakeup │
└───────────┴─────────────────────┴──────┴───────┴────────┘
我们可以 然后删除行。.backward_fill()
event
night = 1
df = (
df.with_columns(
pl.col("event").forward_fill(limit=1), # take event from `night = 1` rows
pl.col("event").backward_fill().over("series_id")
.map_dict({"onset": 0, "wakeup": 1})
.fill_null(0)
.alias("state")
)
.filter(night = None)
)
df.select(expected.columns).frame_equal(expected)
# True
评论
0赞
Olivier D'Ancona
11/9/2023
谢谢,这是在我的数据集上工作的,您的解决方案在 27.4 秒内运行,而另一个具有 join_asof 的解决方案需要 5 秒
0赞
jqurious
11/9/2023
@OlivierD'Ancona:感谢您的更新。是的,但两种解决方案会产生不同的结果。我发布这种方法的原因是因为它产生了预期的输出。在比较它们的运行时之前,您需要添加额外的步骤。
0赞
Olivier D'Ancona
11/9/2023
事实上,我能够使用以前的方法对不匹配事件进行事先过滤
0赞
jqurious
11/9/2023
总体来说,起点仍然可能导致更快的方法 - 我没有进一步探索这种方法。我只是注意到,当我以这种方式组合所有数据时,使用填充获得最终输出并不多。.json_asof
评论
.join_asof
A
2022-01-01 02:00:00 -> 2022-01-01 14:00:00
state
1
A
[[3, 15], [16, 22], [23, 33]]
offset -> wakeup
wakeup -> offset
wakeup
onset
steps