替换 pandas 数据帧中的值

Replace values in a pandas dataframe

提问人:Hemant Sain 提问时间:6/13/2022 最后编辑:Cesar LopesHemant Sain 更新时间:6/13/2022 访问量:386

问:

我有一只熊猫,它是根据事件生成的。每个事件都有一个唯一的 ID,它会在 DataFrame 中生成重复的行。dataframe

问题在于,这些重复行中的一些包含随机值,因为它们彼此不同。

我需要根据每event_id最频繁的值替换列中的值。( Name, Age Occupation)

此外,工资列也需要删除尾随连字符

提前致谢

输入数据



print(df)

ID  event_id   Month    Name    Age Occupation Salary  
1   1_a        Jan      andrew  23             13414.12
2   1_a        Feb              NaN teacher    13414.12
3   1_a        Mar       ___                   13414.12
4   1_a        Apr      andrew  23  teacher    13414.12
5   1_a        May      andrew  24  principle  25000
6   1_b        Jan      Ash     45  scientist  1975.42_
7   1_b        Feb      #$%6        scientist  1975.42
8   1_b        Mar      Ash     45  ^#3a2g4    1975.42
9   1_b        Apr      Ash     45  scientist  1975.42

期望输出 :

print(df)

ID  event_id   Month    Name    Age Occupation Salary
1   1_a        Jan      andrew  24  principle  25000
2   1_a        Feb      andrew  24  principle  25000
3   1_a        Mar      andrew  24  principle  25000
4   1_a        Apr      andrew  24  principle  25000
5   1_a        May      andrew  24  principle  25000
6   1_b        Jan      Ash     45  scientist  1975.42
7   1_b        Feb      Ash     45  scientist  1975.42
8   1_b        Mar      Ash     45  scientist  1975.42
9   1_b        Apr      Ash     45  scientist  1975.42
python pandas 操作 数据 清理 EDA

评论

0赞 Cesar Lopes 6/13/2022
您能否在获取数据帧的地方共享原始数据?
0赞 Hemant Sain 6/13/2022
@CesarLopes不明白兄弟,原始数据被提到为输入,
1赞 Cesar Lopes 6/13/2022
我一直在提到整个逻辑,直到你得到这个最终的 df,对不起,本来可以解释得更好
0赞 mozway 6/13/2022
你如何定义“垃圾”?
0赞 Hemant Sain 6/13/2022
@mozway荒谬的值中的垃圾,我想用它们各自event_id中最常见的值替换它们

答:

1赞 Cesar Lopes 6/13/2022 #1

首先,我必须创建 DataFrame,不幸的是,我无法从带有空格的raw_string中拆分值,但在您的 DataFrame 中,这应该不是问题。

好的,现在的逻辑:

该代码创建一个包含事件唯一值的列表,然后我迭代每个事件的列。通过集合,我可以得到一个字典来计算过滤事件列中值的频率,并且对于最频繁的值,我设置了其他值。

只有当你的表的重复垃圾多于好的值时,这才行不通。 例如: 如果按事件筛选的列中有 30 个垃圾值,但只有好的值重复 2 次,那么好的值将是替换的值。

如果按事件筛选的列中有 30 个垃圾值,但好的值只出现一次,则随机垃圾值将成为替换值。

代码如下:

import pandas as pd
import collections

data =   """ID  event_id   Month    Name    Age Occupation Salary  
            1   1_a        Jan      andrew  23     -       13414.12
            2   1_a        Feb        -     NA  teacher    13414.12
            3   1_a        Mar       ___     -     z       13414.12
            4   1_a        Apr      andrew  23  teacher    13414.12
            5   1_a        May      andrew  24  principle  25000
            6   1_b        Jan      Ash     45  scientist  1975.42_
            7   1_b        Feb      #$%6     -  scientist  1975.42
            8   1_b        Mar      Ash     45  ^#3a2g4    1975.42
            9   1_b        Apr      Ash     45  scientist  1975.42"""

data = data.split('\n')[1:]

for i in range(len(data)):
    data[i] = data[i].split()

df = pd.DataFrame(data, columns=['ID', 'event_id','Month', 'Name', 'Age', 'Occupation', 'Salary'])

print(df)
print('\n')
events = set([x for x in df['event_id']])
columns = ['Name', 'Age', 'Occupation', 'Salary']
for event in events:
    print(df.loc[df['event_id'] == event])
    for column in columns:
        counter = collections.Counter(df.loc[df['event_id'] == event][column])
        print(df.loc[df['event_id'] == event][column])
        print()
        new_value = max(counter, key=counter.get)
        for i in df.loc[df['event_id'] == event][column].index.tolist():
            df[column][i] = new_value

print(df)

输出:

  ID event_id Month    Name Age Occupation    Salary
0  1      1_a   Jan  andrew  23    teacher  13414.12
1  2      1_a   Feb  andrew  23    teacher  13414.12
2  3      1_a   Mar  andrew  23    teacher  13414.12
3  4      1_a   Apr  andrew  23    teacher  13414.12
4  5      1_a   May  andrew  23    teacher  13414.12
5  6      1_b   Jan     Ash  45  scientist   1975.42
6  7      1_b   Feb     Ash  45  scientist   1975.42
7  8      1_b   Mar     Ash  45  scientist   1975.42
8  9      1_b   Apr     Ash  45  scientist   1975.42

Process finished with exit code 0