提问人:Ravindra S 提问时间:7/1/2022 最后编辑:Ravindra S 更新时间:7/1/2022 访问量:161
Pandas:查找一列的值比率,然后按另一列的分组
Pandas: Find ratio of values for a column and then groupby on another column
问:
我正在使用人口普查收入数据集。您可以在“数据文件夹”中找到数据集文件 adult.data。
若要快速重现问题,请按以下步骤加载它:
training_df = pd.read_csv('adult.data', header = None, skipinitialspace = True)
columns = ['age','workclass','fnlwgt','education','education-num','marital-status',
'occupation','relationship','race','sex','capital-gain','capital-loss',
'hours-per-week','native-country','income']
training_df.columns = columns
我试图使用一个简单的比率找出每个native_country的收入不平衡:
收入不平衡=收入<=50K的人口/收入>50K的人口
这是我最幼稚、非 Python 和非 Pandas 的方法:
def native_country_income_imbalance():
income_dict = {}
for index, data in training_df.iterrows():
native_country = data['native-country']
income = data['income']
# 1st number will store count of >50K and second <=50K
income_count = [0,0]
if not income_dict.get(native_country, False):
if income == '>50K':
income_count[0] += 1
income_dict[native_country] = income_count
else:
income_count[1] += 1
income_dict[native_country] = income_count
else:
if income == '>50K':
income_dict[native_country][0] += 1
else:
income_dict[native_country][1] += 1
for country, incomes in income_dict.items():
# For a native_country where there is no one with >50K
# income, we'll make proportion 0 as a special case
if incomes[0] != 0:
proportion = round(incomes[1] / incomes[0], 2)
else:
proportion = 0
income_dict[country] = proportion
income_dict = dict(sorted(income_dict.items(), key=lambda item: item[1],reverse=True))
return income_dict
调用函数
native_country_income_imbalance()
将输出正确返回为
{'Dominican-Republic': 34.0,
'Columbia': 28.5,
'Guatemala': 20.33,
'Mexico': 18.48,
'Honduras': 12.0,
.
.
'Taiwan': 1.55,
'India': 1.5,
'France': 1.42,
'Iran': 1.39,
'Outlying-US(Guam-USVI-etc)': 0,
'Holand-Netherlands': 0}
这显然是冗长的,而不是利用 Pandas 的真正力量(矢量化 + groupby + transform)。我该如何改进?
注意:请随时改进问题标题。
答:
2赞
Shubham Sharma
7/1/2022
#1
Pandas 解决方案
使用 创建一个频率表,然后屏蔽 count 为零的值,然后使用 eval 除以列来计算比率crosstab
<=50K
>50K
s = pd.crosstab(df['native-country'], df['income'])
result = s[s != 0].eval('`<=50K` / `>50K`').fillna(0)
result = result.round(decimals = 2)
result = result.sort_values(ascending=False)
result
native-country
Dominican-Republic 34.00
Columbia 28.50
Guatemala 20.33
Mexico 18.48
Nicaragua 16.00
.
.
India 1.50
France 1.42
Iran 1.39
Outlying-US(Guam-USVI-etc) 0.00
Holand-Netherlands 0.00
dtype: float64
评论
0赞
Ravindra S
7/1/2022
这真是太棒了。给熊猫更多力量!
评论