提问人:Hemant Sain 提问时间:6/13/2022 最后编辑:Cesar LopesHemant Sain 更新时间:6/13/2022 访问量:386
替换 pandas 数据帧中的值
Replace values in a pandas dataframe
问:
我有一只熊猫,它是根据事件生成的。每个事件都有一个唯一的 ID,它会在 DataFrame 中生成重复的行。dataframe
问题在于,这些重复行中的一些包含随机值,因为它们彼此不同。
我需要根据每event_id最频繁的值替换列中的值。( Name, Age Occupation)
此外,工资列也需要删除尾随连字符
提前致谢
输入数据
print(df)
ID event_id Month Name Age Occupation Salary
1 1_a Jan andrew 23 13414.12
2 1_a Feb NaN teacher 13414.12
3 1_a Mar ___ 13414.12
4 1_a Apr andrew 23 teacher 13414.12
5 1_a May andrew 24 principle 25000
6 1_b Jan Ash 45 scientist 1975.42_
7 1_b Feb #$%6 scientist 1975.42
8 1_b Mar Ash 45 ^#3a2g4 1975.42
9 1_b Apr Ash 45 scientist 1975.42
期望输出 :
print(df)
ID event_id Month Name Age Occupation Salary
1 1_a Jan andrew 24 principle 25000
2 1_a Feb andrew 24 principle 25000
3 1_a Mar andrew 24 principle 25000
4 1_a Apr andrew 24 principle 25000
5 1_a May andrew 24 principle 25000
6 1_b Jan Ash 45 scientist 1975.42
7 1_b Feb Ash 45 scientist 1975.42
8 1_b Mar Ash 45 scientist 1975.42
9 1_b Apr Ash 45 scientist 1975.42
答:
1赞
Cesar Lopes
6/13/2022
#1
首先,我必须创建 DataFrame,不幸的是,我无法从带有空格的raw_string中拆分值,但在您的 DataFrame 中,这应该不是问题。
好的,现在的逻辑:
该代码创建一个包含事件唯一值的列表,然后我迭代每个事件的列。通过集合,我可以得到一个字典来计算过滤事件列中值的频率,并且对于最频繁的值,我设置了其他值。
只有当你的表的重复垃圾多于好的值时,这才行不通。 例如: 如果按事件筛选的列中有 30 个垃圾值,但只有好的值重复 2 次,那么好的值将是替换的值。
如果按事件筛选的列中有 30 个垃圾值,但好的值只出现一次,则随机垃圾值将成为替换值。
代码如下:
import pandas as pd
import collections
data = """ID event_id Month Name Age Occupation Salary
1 1_a Jan andrew 23 - 13414.12
2 1_a Feb - NA teacher 13414.12
3 1_a Mar ___ - z 13414.12
4 1_a Apr andrew 23 teacher 13414.12
5 1_a May andrew 24 principle 25000
6 1_b Jan Ash 45 scientist 1975.42_
7 1_b Feb #$%6 - scientist 1975.42
8 1_b Mar Ash 45 ^#3a2g4 1975.42
9 1_b Apr Ash 45 scientist 1975.42"""
data = data.split('\n')[1:]
for i in range(len(data)):
data[i] = data[i].split()
df = pd.DataFrame(data, columns=['ID', 'event_id','Month', 'Name', 'Age', 'Occupation', 'Salary'])
print(df)
print('\n')
events = set([x for x in df['event_id']])
columns = ['Name', 'Age', 'Occupation', 'Salary']
for event in events:
print(df.loc[df['event_id'] == event])
for column in columns:
counter = collections.Counter(df.loc[df['event_id'] == event][column])
print(df.loc[df['event_id'] == event][column])
print()
new_value = max(counter, key=counter.get)
for i in df.loc[df['event_id'] == event][column].index.tolist():
df[column][i] = new_value
print(df)
输出:
ID event_id Month Name Age Occupation Salary
0 1 1_a Jan andrew 23 teacher 13414.12
1 2 1_a Feb andrew 23 teacher 13414.12
2 3 1_a Mar andrew 23 teacher 13414.12
3 4 1_a Apr andrew 23 teacher 13414.12
4 5 1_a May andrew 23 teacher 13414.12
5 6 1_b Jan Ash 45 scientist 1975.42
6 7 1_b Feb Ash 45 scientist 1975.42
7 8 1_b Mar Ash 45 scientist 1975.42
8 9 1_b Apr Ash 45 scientist 1975.42
Process finished with exit code 0
评论