比较 2 个 DF，其中一个是另一个 DF 的子集，并且子集的值发生变化，并且只需要为子集的输入更新主 DF

Comparing 2 DF where one is a subset of the other and the values for the subset change and need to update the main one only for the inputs from subset

提问人：AGc 提问时间：3/7/2023 最后编辑：Barbara GendronAGc 更新时间：3/8/2023 访问量：56

问：

我有一个包含 4 列的主数据框，可以说：

工资
名字
年龄
输出（默认值为 0）。

这将是.然后，我基于此数据帧创建另一个数据帧，但我根据年龄低于 25 岁的人将其过滤掉。这将是.然后我想将每个工资与一个数字进行比较。比较的输出我想把它插入到 in 中，基本上更新了它已经拥有的默认值。然后我想更新仅针对 .我没有直接使用因为我不想应用上面的过滤器并丢失其他条目。df_maindf_subsetdf_subsetOutputdf_subsetOutputdf_maindf_subsetdf_main

是否有可能以最佳方式做到这一点？

Python 数据帧比较

答：

1赞 user19077881 3/7/2023 #1

您可以使用掩码来形成子集，更改子集并使用相同的掩码将更改写回。有关该方法的一个非常简单的示例，请参见下文。注意 - 您必须处理掩码子集的副本。

import pandas as pd

df = pd.DataFrame({ 'col1': [1, 2, 1, 2, 1, 2],
                    'col2': [1.1, 1.3, 3.4, 4.5, 3.2, 2.6]
                    })


dfx = df[df['col1'] == 2].copy()
dfx['col2'].where(dfx['col2'] < 2.0, 9.9, inplace = True) # change some values
df['col2'][df['col1'] == 2] = dfx['col2'] # write changes back to df
print(df)

1赞 Corralien 3/7/2023 #2

您可以使用：

# Create a boolean mask
m = df_main['Age'] < 25
df_subset = df_main.loc[m].copy()

# Compare each Wage against a number
number = 3000  # A number for what? Compute difference?
df_subset['Output'] = df_subset['Wage'] - number

# Returns to df_main without overriding existing values
df_main.loc[m, 'Output'] = df_subset['Output']

输出：

>>> df_main
    Wage  Age  Output
0   4658   50       0
1   4758   23    1758
2   4940   29       0
3   4184   37       0
4   4648   48       0
..   ...  ...     ...
95  1634   63       0
96  3446   23     446
97  2173   53       0
98  1225   44       0
99  3498   25       0

[100 rows x 3 columns]

>>> df_subset
    Wage  Age  Output
1   4758   23    1758
16  2063   21    -937
19  2191   22    -809
30  4552   21    1552
34  1920   23   -1080
42  1389   20   -1611
45  4640   24    1640
64  3065   24      65
76  3966   20     966
81  1033   20   -1967
82  1033   24   -1967
86  4655   22    1655
96  3446   23     446

更直接的方法是：

df_main['Output'] = np.where(df_main['Age'] < 25,
                             df_main['Wage'] - number,  # Age < 25
                             df_main['Output'])  # Other

输入数据帧：

import pandas as pd
import numpy as np

rng = np.random.default_rng(2023)
df_main = pd.DataFrame({'Wage': np.random.randint(1000, 5000, 100),
                        'Age': np.random.randint(20, 65, 100),
                        'Output': 0})
print(df_main)

# Output, Name column is not significant here
    Wage  Age  Output
0   2893   34       0
1   1352   21       0
2   1629   23       0
3   1881   49       0
4   2193   34       0
..   ...  ...     ...
95  4404   31       0
96  2694   50       0
97  4371   33       0
98  3263   34       0
99  2719   45       0

[100 rows x 3 columns]

上一个：将具有相同结构的多个列表与最低位置处具有较小整数的列表进行比较

下一个：为什么 PySpark 代码挂起一段时间，然后在访问数据帧时突然终止

比较 2 个 DF，其中一个是另一个 DF 的子集，并且子集的值发生变化，并且只需要为子集的输入更新主 DF

Comparing 2 DF where one is a subset of the other and the values for the subset change and need to update the main one only for the inputs from subset

评论