Pandas:查找一列的值比率,然后按另一列的分组

Pandas: Find ratio of values for a column and then groupby on another column

提问人:Ravindra S 提问时间:7/1/2022 最后编辑:Ravindra S 更新时间:7/1/2022 访问量:161

问:

我正在使用人口普查收入数据集。您可以在“数据文件夹”中找到数据集文件 adult.data。

若要快速重现问题,请按以下步骤加载它:

training_df = pd.read_csv('adult.data', header = None, skipinitialspace = True)
columns = ['age','workclass','fnlwgt','education','education-num','marital-status',
           'occupation','relationship','race','sex','capital-gain','capital-loss',
           'hours-per-week','native-country','income']
training_df.columns = columns

我试图使用一个简单的比率找出每个native_country的收入不平衡:

收入不平衡=收入<=50K的人口/收入>50K的人口

这是我最幼稚、非 Python 和非 Pandas 的方法:

def native_country_income_imbalance():
    income_dict = {}
    for index, data in training_df.iterrows():
        native_country = data['native-country']
        income = data['income']

        # 1st number will store count of >50K and second <=50K
        income_count = [0,0]
        if not income_dict.get(native_country, False):
            if income == '>50K':
                income_count[0] += 1
                income_dict[native_country] = income_count
            else:
                income_count[1] += 1
                income_dict[native_country] = income_count
        else: 
            if income == '>50K':
                income_dict[native_country][0] += 1
            else:
                income_dict[native_country][1] += 1
    
    for country, incomes in income_dict.items():
            # For a native_country where there is no one with >50K
            # income, we'll make proportion 0 as a special case
            if incomes[0] != 0:
                proportion = round(incomes[1] / incomes[0], 2)
            else:
                proportion = 0
            income_dict[country] = proportion
    income_dict = dict(sorted(income_dict.items(), key=lambda item: item[1],reverse=True))
    
    return income_dict

调用函数

native_country_income_imbalance()

将输出正确返回为

{'Dominican-Republic': 34.0,
 'Columbia': 28.5,
 'Guatemala': 20.33,
 'Mexico': 18.48,
 'Honduras': 12.0,
 .
 .
 'Taiwan': 1.55,
 'India': 1.5,
 'France': 1.42,
 'Iran': 1.39,
 'Outlying-US(Guam-USVI-etc)': 0,
 'Holand-Netherlands': 0}

这显然是冗长的,而不是利用 Pandas 的真正力量(矢量化 + groupby + transform)。我该如何改进?

注意:请随时改进问题标题。

python pandas numpy pandas-groupby 数据操作

评论


答:

2赞 Shubham Sharma 7/1/2022 #1

Pandas 解决方案

使用 创建一个频率表,然后屏蔽 count 为零的值,然后使用 eval 除以列来计算比率crosstab<=50K>50K

s = pd.crosstab(df['native-country'], df['income'])
result = s[s != 0].eval('`<=50K` / `>50K`').fillna(0)
result = result.round(decimals = 2)
result = result.sort_values(ascending=False)
result

native-country
Dominican-Republic            34.00
Columbia                      28.50
Guatemala                     20.33
Mexico                        18.48
Nicaragua                     16.00
.
.
India                          1.50
France                         1.42
Iran                           1.39
Outlying-US(Guam-USVI-etc)     0.00
Holand-Netherlands             0.00
dtype: float64

评论

0赞 Ravindra S 7/1/2022
这真是太棒了。给熊猫更多力量!