如何在 Data Frame 中聚合几列的值,以便在 Python Pandas 中总和 NaN 而不是 0 的情况下具有 NaN?

How to aggregate values in Data Frame for a few columns to have NaN in case of sum NaN instead of 0 in Python Pandas?

提问人:dingaro 提问时间:11/8/2023 更新时间:11/8/2023 访问量:46

问:

我在 Python Pandas 中有 Data Frame,如下所示:

输入数据:

df = pd.DataFrame({
    'id' : [999, 999, 999, 185, 185, 185, 999, 999, 999],
    'target' : [1, 1, 1, 0, 0, 0, 1, 1, 1],
    'event': ['2023-01-01', '2023-01-01', '2023-02-03', '2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
    'survey': ['2023-02-02', '2023-02-02', '2023-02-02', '2023-03-10', '2023-03-10', '2023-03-10', '2023-04-22', '2023-04-22', '2023-04-22'],
    'event1': [1, 6, 11, 16, np.nan, 22, 74, 109, 52],
    'event2': [2, 7, np.nan, 17, 22, np.nan, np.nan, 10, 5],
    'event3': [3, 8, 13, 18, 23, np.nan, 2, np.nan, 99],
    'event4': [4, 9, np.nan, np.nan, np.nan, 11, 8, np.nan, np.nan],
    'event5': [np.nan, np.nan, 15, 20, 25, 1, 1, 3, np.nan]
})

df

enter image description here

正如您在“event5”列中看到的 id = 999,我有 2 倍的 NaN 用于该 id,事件 = 2023-01-01。

要求:

我需要聚合该数据框,并将“event”列中同一日期的每个 id 的 event1、event2、event3、event4、event5 列中的所有值相加。

例如,如果 id = 999 有 2 行,事件 = 2023-01-01,我需要将 event1、event2、event3、event4、event5 列中的所有值相加,使该 id 有一行。

我在Python Pandas中有这样的代码:

column_names = df.columns
df = df.groupby(["id","target", "survey", "event"]).agg({col: 'sum' for col in column_names if col not in ["id","target", "survey", "event"]})
df.reset_index(inplace = True)
df

尽管如此,当我使用该代码时,NaN 值的总和返回 0,但如果我必须对 NaN 值求和,我希望有 NaN:

enter image description here

示例结果:

因此,我需要如下所示,其中 NaN 的总和将是 NaN 而不是 0。

enter image description here

我怎样才能修改我的代码来实现这一点,或者你有一些其他的想法?

Python Pandas DataFrame 总和 聚合

评论


答:

1赞 mozway 11/8/2023 #1

默认情况下,pandas 的总和会跳过 NaNs,你可以传递:skipna=False

out = (df.groupby(["id","target", "survey", "event"], as_index=False)
         .agg(lambda x: x.sum(skipna=False))
       )

或者使用底层的 numpy 数组,因为 numpy 的总和不对 NaN 求

out = (df.groupby(["id","target", "survey", "event"], as_index=False)
         .agg(lambda x: x.to_numpy().sum())
       )

输出:

    id  target      survey       event  event1  event2  event3  event4  event5
0  185       0  2023-03-10  2023-01-01    16.0    17.0    18.0     NaN    20.0
1  185       0  2023-03-10  2023-01-02     NaN    22.0    23.0     NaN    25.0
2  185       0  2023-03-10  2023-01-03    22.0     NaN     NaN    11.0     1.0
3  999       1  2023-02-02  2023-01-01     7.0     9.0    11.0    13.0     NaN
4  999       1  2023-02-02  2023-02-03    11.0     NaN    13.0     NaN    15.0
5  999       1  2023-04-22  2023-01-01    74.0     NaN     2.0     8.0     1.0
6  999       1  2023-04-22  2023-01-02   109.0    10.0     NaN     NaN     3.0
7  999       1  2023-04-22  2023-01-03    52.0     5.0    99.0     NaN     NaN

仅当 NaN 超过 NaN 时才输出N

如果 NaN 的数量高于阈值,您也可以动态决定是否具有 NaN,例如忽略 1 个 NaN,而不是忽略 2 个或更多:

N = 2
out = (df.groupby(["id","target", "survey", "event"], as_index=False)
         .agg(lambda x: x.sum(skipna=x.isna().sum()<N))
       )

输出:

    id  target      survey       event  event1  event2  event3  event4  event5
0  185       0  2023-03-10  2023-01-01    16.0    17.0    18.0     0.0    20.0
1  185       0  2023-03-10  2023-01-02     0.0    22.0    23.0     0.0    25.0
2  185       0  2023-03-10  2023-01-03    22.0     0.0     0.0    11.0     1.0
3  999       1  2023-02-02  2023-01-01     7.0     9.0    11.0    13.0     NaN
4  999       1  2023-02-02  2023-02-03    11.0     0.0    13.0     0.0    15.0
5  999       1  2023-04-22  2023-01-01    74.0     0.0     2.0     8.0     1.0
6  999       1  2023-04-22  2023-01-02   109.0    10.0     0.0     0.0     3.0
7  999       1  2023-04-22  2023-01-03    52.0     5.0    99.0     0.0     0.0