提问人:dingaro 提问时间:11/8/2023 更新时间:11/8/2023 访问量:46
如何在 Data Frame 中聚合几列的值,以便在 Python Pandas 中总和 NaN 而不是 0 的情况下具有 NaN?
How to aggregate values in Data Frame for a few columns to have NaN in case of sum NaN instead of 0 in Python Pandas?
问:
我在 Python Pandas 中有 Data Frame,如下所示:
输入数据:
df = pd.DataFrame({
'id' : [999, 999, 999, 185, 185, 185, 999, 999, 999],
'target' : [1, 1, 1, 0, 0, 0, 1, 1, 1],
'event': ['2023-01-01', '2023-01-01', '2023-02-03', '2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
'survey': ['2023-02-02', '2023-02-02', '2023-02-02', '2023-03-10', '2023-03-10', '2023-03-10', '2023-04-22', '2023-04-22', '2023-04-22'],
'event1': [1, 6, 11, 16, np.nan, 22, 74, 109, 52],
'event2': [2, 7, np.nan, 17, 22, np.nan, np.nan, 10, 5],
'event3': [3, 8, 13, 18, 23, np.nan, 2, np.nan, 99],
'event4': [4, 9, np.nan, np.nan, np.nan, 11, 8, np.nan, np.nan],
'event5': [np.nan, np.nan, 15, 20, 25, 1, 1, 3, np.nan]
})
df
正如您在“event5”列中看到的 id = 999,我有 2 倍的 NaN 用于该 id,事件 = 2023-01-01。
要求:
我需要聚合该数据框,并将“event”列中同一日期的每个 id 的 event1、event2、event3、event4、event5 列中的所有值相加。
例如,如果 id = 999 有 2 行,事件 = 2023-01-01,我需要将 event1、event2、event3、event4、event5 列中的所有值相加,使该 id 有一行。
我在Python Pandas中有这样的代码:
column_names = df.columns
df = df.groupby(["id","target", "survey", "event"]).agg({col: 'sum' for col in column_names if col not in ["id","target", "survey", "event"]})
df.reset_index(inplace = True)
df
尽管如此,当我使用该代码时,NaN 值的总和返回 0,但如果我必须对 NaN 值求和,我希望有 NaN:
示例结果:
因此,我需要如下所示,其中 NaN 的总和将是 NaN 而不是 0。
我怎样才能修改我的代码来实现这一点,或者你有一些其他的想法?
答:
1赞
mozway
11/8/2023
#1
默认情况下,pandas 的总和
会跳过 NaNs,你可以传递:skipna=False
out = (df.groupby(["id","target", "survey", "event"], as_index=False)
.agg(lambda x: x.sum(skipna=False))
)
或者使用底层的 numpy 数组,因为 numpy 的总和不对 NaN 求和
:
out = (df.groupby(["id","target", "survey", "event"], as_index=False)
.agg(lambda x: x.to_numpy().sum())
)
输出:
id target survey event event1 event2 event3 event4 event5
0 185 0 2023-03-10 2023-01-01 16.0 17.0 18.0 NaN 20.0
1 185 0 2023-03-10 2023-01-02 NaN 22.0 23.0 NaN 25.0
2 185 0 2023-03-10 2023-01-03 22.0 NaN NaN 11.0 1.0
3 999 1 2023-02-02 2023-01-01 7.0 9.0 11.0 13.0 NaN
4 999 1 2023-02-02 2023-02-03 11.0 NaN 13.0 NaN 15.0
5 999 1 2023-04-22 2023-01-01 74.0 NaN 2.0 8.0 1.0
6 999 1 2023-04-22 2023-01-02 109.0 10.0 NaN NaN 3.0
7 999 1 2023-04-22 2023-01-03 52.0 5.0 99.0 NaN NaN
仅当 NaN 超过 NaN 时才输出N
如果 NaN 的数量高于阈值,您也可以动态决定是否具有 NaN,例如忽略 1 个 NaN,而不是忽略 2 个或更多:
N = 2
out = (df.groupby(["id","target", "survey", "event"], as_index=False)
.agg(lambda x: x.sum(skipna=x.isna().sum()<N))
)
输出:
id target survey event event1 event2 event3 event4 event5
0 185 0 2023-03-10 2023-01-01 16.0 17.0 18.0 0.0 20.0
1 185 0 2023-03-10 2023-01-02 0.0 22.0 23.0 0.0 25.0
2 185 0 2023-03-10 2023-01-03 22.0 0.0 0.0 11.0 1.0
3 999 1 2023-02-02 2023-01-01 7.0 9.0 11.0 13.0 NaN
4 999 1 2023-02-02 2023-02-03 11.0 0.0 13.0 0.0 15.0
5 999 1 2023-04-22 2023-01-01 74.0 0.0 2.0 8.0 1.0
6 999 1 2023-04-22 2023-01-02 109.0 10.0 0.0 0.0 3.0
7 999 1 2023-04-22 2023-01-03 52.0 5.0 99.0 0.0 0.0
评论