提问人:Jammy Wang 提问时间:11/9/2023 最后编辑:Jammy Wang 更新时间:11/9/2023 访问量:62
如何将一列中的多个字典与数据帧中不同行的相同键组合在一起
How to combine the multiple dictionaries in a column with the same key for different rows in dataframe
问:
数据框如下:
ID area data_info
a001 NY [{color: 'Yellow', 'count': 3, 'weight': 5}, {color: 'Blue', 'count': 2, 'weight': 11} , {color: 'Red', 'count': 7, 'weight': 3}]
a002 SF [{color: 'Green', 'count': 1, 'weight': 14},{color: 'Yellow', 'count': 9, 'weight': 2}]
a003 NY [{color: 'Blue', 'count': 5, 'weight': 6}, {color: 'Black', 'count': 2, 'weight': 15}]
具有相同区域的行将data_info中的信息合并,并根据相同的颜色键值聚合计数和权重。
预期成果:
ID a001 和 a003 都具有相同的面积,NY,因此 data_info 的值将合并并聚合为一行。由于 a001 和 a003 在字典中的颜色为蓝色,因此它将聚合计数和重量的总和。
area data_info
NY [{color: 'Yellow', 'count': 3, 'weight': 5}, {color: '**Blue**', 'count': **7**, 'weight': **17**} , {color: 'Red', 'count': 7, 'weight': 3}, {color: 'Black', 'count': 2, 'weight': 15}]
SF [{color: 'Green', 'count': 1, 'weight': 14},{color: 'Yellow', 'count': 9, 'weight': 2}]
答:
1赞
Corralien
11/9/2023
#1
假设您在 data_info
中有一个有效的数据结构(字典列表)。
您可以使用从列中提取数据:pd.json_normalize
data_info
# Create an index to allow merging the area column
idx = df.index.repeat(df['data_info'].str.len())
# Extract data and set the area column
dat = (pd.json_normalize(df['data_info'].explode())
.assign(area=df.loc[idx, 'area'].values))
# Compute the sum
dat = dat.groupby(['color', 'area']).sum().reset_index('color')
# Create the expected output
out = (pd.DataFrame({'area': dat.index, 'data_info': dat.to_dict('records')})
.groupby('area', as_index=False).agg(list))
输出:
>>> out
area data_info
0 NY [{'color': 'Black', 'count': 2, 'weight': 15}, {'color': 'Blue', 'count': 7, 'weight': 17}, {'color': 'Red', 'count': 7, 'weight': 3}, {'color': 'Yellow', 'count': 3, 'weight': 5}]
1 SF [{'color': 'Green', 'count': 1, 'weight': 14}, {'color': 'Yellow', 'count': 9, 'weight': 2}]
中间步骤
>>> (pd.json_normalize(df['data_info'].explode())
.assign(area=df.loc[idx, 'area'].values))
color count weight area
0 Yellow 3 5 NY
1 Blue 2 11 NY
2 Red 7 3 NY
3 Green 1 14 SF
4 Yellow 9 2 SF
5 Blue 5 6 NY
6 Black 2 15 NY
>>> dat.groupby(['color', 'area']).sum().reset_index('color')
color count weight
area
NY Black 2 15
NY Blue 7 17
SF Green 1 14
NY Red 7 3
NY Yellow 3 5
SF Yellow 9 2
0赞
Ömer Sezer
11/9/2023
#2
带有 for 循环的经典方式。它会在最后创建一个文件(result.txt)以更好地查看结果。
法典:
import pandas as pd
data = {
'ID': ['a001', 'a002', 'a003'],
'area': ['NY', 'SF', 'NY'],
'data_info': [
[{'color': 'Yellow', 'count': 3, 'weight': 5}, {'color': 'Blue', 'count': 2, 'weight': 11}, {'color': 'Red', 'count': 7, 'weight': 3}],
[{'color': 'Green', 'count': 1, 'weight': 14}, {'color': 'Yellow', 'count': 9, 'weight': 2}],
[{'color': 'Blue', 'count': 5, 'weight': 6}, {'color': 'Black', 'count': 2, 'weight': 15}]
]
}
df = pd.DataFrame(data)
def combine_data_info(group):
combined_data = {}
for _, row in group.iterrows():
for entry in row['data_info']:
color = entry['color']
if color not in combined_data:
combined_data[color] = {'color': color, 'count': 0, 'weight': 0}
combined_data[color]['count'] += entry['count']
combined_data[color]['weight'] += entry['weight']
return list(combined_data.values())
result = df.groupby('area').apply(combine_data_info).reset_index(name='data_info')
result.to_csv('result.txt', sep='\t', index=False)
输出:
area data_info
NY [{'color': 'Yellow', 'count': 3, 'weight': 5}, {'color': 'Blue', 'count': 7, 'weight': 17}, {'color': 'Red', 'count': 7, 'weight': 3}, {'color': 'Black', 'count': 22, 'weight': 165}]
SF [{'color': 'Green', 'count': 1, 'weight': 14}, {'color': 'Yellow', 'count': 9, 'weight': 2}]
0赞
mozway
11/9/2023
#3
使用自定义函数和 groupby.agg
:
from itertools import chain
def agg_dict(s):
return (pd.DataFrame(chain.from_iterable(s))
.groupby('color', as_index=False).sum()
.to_dict('records')
)
out = df.groupby('area', as_index=False)['data_info'].agg(agg_dict)
输出:
area data_info
0 NY [{'color': 'Black', 'count': 2, 'weight': 15}, {'color': 'Blue', 'count': 7, 'weight': 17}, {'color': 'Red', 'count': 7, 'weight': 3}, {'color': 'Yellow', 'count': 3, 'weight': 5}]
1 SF [{'color': 'Green', 'count': 1, 'weight': 14}, {'color': 'Yellow', 'count': 9, 'weight': 2}]
评论
data_info