Pandas:使用滚动窗口计算唯一用户数

Pandas: count unique users using rolling window

提问人:Petr Petrov 提问时间:11/16/2023 更新时间:11/16/2023 访问量:45

问:

我有时间和user_id的数据帧

              time           user_id
2023-02-20 00:00:20  5008662006351712
2023-02-20 00:01:25  5008662006474892
2023-02-20 00:04:28  5008662006889403
2023-02-20 00:05:33  5008662006351712
2023-02-20 00:07:36  5008662004944382
2023-02-20 00:08:37  5008662006760417
2023-02-20 00:09:38  5008662004941892
2023-02-20 00:11:40  5008662006810617
2023-02-20 00:14:50  5008662006936927
2023-02-20 00:15:52  5008662005514572
2023-02-20 00:16:58  5008662004874462
2023-02-20 00:17:01  5008662006937193
2023-02-20 00:17:05  5008662006914843
2023-02-20 00:18:05  5008662006871041
2023-02-20 00:19:06  5008662006478082

我想计算每个窗口大小 * “5T” 中的唯一用户数。 我遇到的问题是我可以将此数据转换为“5T”,但我不能使用,因为它仅适用于数字数据:resamplerolling

window_size = 2
df = df.resample("5T", label='right', on='time').apply(lambda x: list(set(x))).reset_index()
df = df.rolling(window=window_size, min_periods=window_size, center=False).nunique()

之后的数据如下所示resample

             time    user_id
2023-02-20 00:05:00 [5008662006351712, 5008662006889403, 500866200...
2023-02-20 00:10:00 [5008662004941892, 5008662006760417, 500866200...
2023-02-20 00:15:00 [5008662006810617, 5008662006936927]
2023-02-20 00:20:00 [5008662006871041, 5008662006937193, 500866200...

我应该如何更改我的代码以获得此输出?

              time  user_id
2023-02-20 00:05:00 NaN
2023-02-20 00:10:00 6
2023-02-20 00:15:00 6
2023-02-20 00:20:00 8
Python 熊猫

评论


答:

0赞 Suraj Shourie 11/16/2023 #1

IIUC,如果你只是想计算滚动计数,你不需要做操作。请参阅下面的代码:list(set(_))

df2 = df.resample("5T", label='right').apply(lambda x: len(set(x))).reset_index()
df2['count'] = df2.rolling(2)['user_id'].sum()
print(df2)

输出:

                 time  user_id  count
0 2023-02-20 00:05:00        3    NaN
1 2023-02-20 00:10:00        4    7.0
2 2023-02-20 00:15:00        2    6.0
3 2023-02-20 00:20:00        6    8.0

评论

0赞 piotre10 11/16/2023
它不起作用,因为如果用户同时出现在第一个和第二个 5 分钟窗口中,它会计算用户两次
2赞 Andrej Kesely 11/16/2023 #2

尝试:

df["user_id"] = (
    df.reset_index()
    .rolling(window=window_size, min_periods=window_size, center=False)["index"]
    .apply(lambda i: len(set(df.loc[i, "user_id"].sum())))
)

print(df)

指纹:

                 time  user_id
0 2023-02-20 00:05:00      NaN
1 2023-02-20 00:10:00      6.0
2 2023-02-20 00:15:00      6.0
3 2023-02-20 00:20:00      8.0

评论

0赞 Petr Petrov 11/17/2023
它返回一个错误TypeError: operands could not be broadcast together with shapes (3,) (4,)