提问人:Petr Petrov 提问时间:11/16/2023 更新时间:11/16/2023 访问量:45
Pandas:使用滚动窗口计算唯一用户数
Pandas: count unique users using rolling window
问:
我有时间和user_id的数据帧
time user_id
2023-02-20 00:00:20 5008662006351712
2023-02-20 00:01:25 5008662006474892
2023-02-20 00:04:28 5008662006889403
2023-02-20 00:05:33 5008662006351712
2023-02-20 00:07:36 5008662004944382
2023-02-20 00:08:37 5008662006760417
2023-02-20 00:09:38 5008662004941892
2023-02-20 00:11:40 5008662006810617
2023-02-20 00:14:50 5008662006936927
2023-02-20 00:15:52 5008662005514572
2023-02-20 00:16:58 5008662004874462
2023-02-20 00:17:01 5008662006937193
2023-02-20 00:17:05 5008662006914843
2023-02-20 00:18:05 5008662006871041
2023-02-20 00:19:06 5008662006478082
我想计算每个窗口大小 * “5T” 中的唯一用户数。
我遇到的问题是我可以将此数据转换为“5T”,但我不能使用,因为它仅适用于数字数据:resample
rolling
window_size = 2
df = df.resample("5T", label='right', on='time').apply(lambda x: list(set(x))).reset_index()
df = df.rolling(window=window_size, min_periods=window_size, center=False).nunique()
之后的数据如下所示resample
time user_id
2023-02-20 00:05:00 [5008662006351712, 5008662006889403, 500866200...
2023-02-20 00:10:00 [5008662004941892, 5008662006760417, 500866200...
2023-02-20 00:15:00 [5008662006810617, 5008662006936927]
2023-02-20 00:20:00 [5008662006871041, 5008662006937193, 500866200...
我应该如何更改我的代码以获得此输出?
time user_id
2023-02-20 00:05:00 NaN
2023-02-20 00:10:00 6
2023-02-20 00:15:00 6
2023-02-20 00:20:00 8
答:
0赞
Suraj Shourie
11/16/2023
#1
IIUC,如果你只是想计算滚动计数,你不需要做操作。请参阅下面的代码:list(set(_))
df2 = df.resample("5T", label='right').apply(lambda x: len(set(x))).reset_index()
df2['count'] = df2.rolling(2)['user_id'].sum()
print(df2)
输出:
time user_id count
0 2023-02-20 00:05:00 3 NaN
1 2023-02-20 00:10:00 4 7.0
2 2023-02-20 00:15:00 2 6.0
3 2023-02-20 00:20:00 6 8.0
评论
0赞
piotre10
11/16/2023
它不起作用,因为如果用户同时出现在第一个和第二个 5 分钟窗口中,它会计算用户两次
2赞
Andrej Kesely
11/16/2023
#2
尝试:
df["user_id"] = (
df.reset_index()
.rolling(window=window_size, min_periods=window_size, center=False)["index"]
.apply(lambda i: len(set(df.loc[i, "user_id"].sum())))
)
print(df)
指纹:
time user_id
0 2023-02-20 00:05:00 NaN
1 2023-02-20 00:10:00 6.0
2 2023-02-20 00:15:00 6.0
3 2023-02-20 00:20:00 8.0
评论
0赞
Petr Petrov
11/17/2023
它返回一个错误TypeError: operands could not be broadcast together with shapes (3,) (4,)
评论