用于过滤数据帧的动态 for 循环

Dynamic for loop for filtering dataframe

提问人:user22681865 提问时间:10/4/2023 最后编辑:OCauser22681865 更新时间:10/5/2023 访问量:82

问:

我正在寻找一种无需手动添加for循环即可过滤数据的方法。

我有以下代码,而我只想指定我希望过滤数据的列(在名为“col_list”的变量中)。有没有一种更简单、更通用的方法来实现这一目标?

import pandas as pd

df = pd.DataFrame({
    'Col1':['A','A','A','B','B','B'],
    'Col2':['AA','AB','AC','BA','BB','BC'],
    'Val':[1,2,3,1,2,3]
})

# Insert all Cols that should be looped through
col_list = ['Col1','Col2']

# Loop through Col1
for i in ['total'] + list(df[col_list[0]].unique()):
    d1 = df[(df[col_list[0]] == i)] if i != 'total' else df.copy()
    
    # Loop through Col2
    for j in ['total'] + list(d1[col_list[1]].unique()):
        d2 = d1[(d1[col_list[1]] == j)]  if j != 'total' else d1.copy()
        print(f"{i} | {j} | {sum(d2['Val'])}")

输出:

total | total | 12
total | AA | 1
total | AB | 2
total | AC | 3
total | BA | 1
total | BB | 2
total | BC | 3
A | total | 6
A | AA | 1
A | AB | 2
A | AC | 3
B | total | 6
B | BA | 1
B | BB | 2
B | BC | 3

所需方法:

col_list = ['Col1','Col2']

get_data(df,col_list):
    '''Formula to return the data'''
Python Pandas 循环 for 循环 矢量化

评论

4赞 Panda Kim 10/4/2023
在示例中,您期望的输出是什么?
0赞 OCa 10/9/2023
有没有适合您的解决方案?

答:

0赞 Umar 10/4/2023 #1

这样可以吗

import pandas as pd

def get_data(df, col_list):
    filtered_df = df.copy()
    for col in col_list:
        unique_values = df[col].unique()
        if 'total' in unique_values:
            filtered_df = filtered_df[(filtered_df[col] == 'total') | (filtered_df[col].isin(unique_values))]
        else:
            filtered_df = filtered_df[filtered_df[col].isin(unique_values)]
    return filtered_df

df = pd.DataFrame({
    'Col1':['A','A','A','B','B','B'],
    'Col2':['AA','AB','AC','BA','BB','BC'],
    'Val':[1,2,3,1,2,3]
})

col_list = ['Col1','Col2']
get_data(df, col_list)
0赞 OCa 10/4/2023 #2

根据代码的打印输出,您试图计算两个(动态指定的)列中每个元素组合的值总和 ('Val')。

下面重现了打印输出,而不对列进行硬编码,并将结果收集到单个数据帧中:

import pandas as pd
import numpy as np

def get_data(df, col_list):
    '''Operate sum() over all single and two-column groupings'''

    # 1. Grand total
    Total = pd.DataFrame({'group_1' : 'Total',
                          'Sum'     : [df['Val'].sum()]})
    
    # 2. Sums by one value
    Sums_by_1_col = pd.concat([df[[c, 'Val']].groupby(by=c).sum()
                                             .reset_index()
                                             .set_axis(['group_1','Sum'], axis=1) 
                               for c in col_list], axis=0)
    
    # 3.Sums grouping by values in 2 columns at a time
    Sums_by_2_col = df.groupby(by=col_list).sum(
                                          ).reset_index(drop=False
                                          ).set_axis(['group_1','group_2','Sum'], axis=1)
    
    # Assemble 1, 2 and 3
    return pd.concat([Sums_by_2_col,
                      Sums_by_1_col,
                      Total         ], axis=0  
                    ).replace([np.nan], [None] # pd.concat places NaN by default. None better fits a non numerical column.
                    ).reset_index(drop=True)

col_list = ['Col1','Col2']

get_data(df, col_list)

输出:

   group_1 group_2  Sum
0        A      AA    1
1        A      AB    2
2        A      AC    3
3        B      BA    1
4        B      BB    2
5        B      BC    3
6        A    None    6
7        B    None    6
8       AA    None    1
9       AB    None    2
10      AC    None    3
11      BA    None    1
12      BB    None    2
13      BC    None    3
14   Total    None   12

在您的问题中,您专门询问了动态迭代列的问题,但根本没有提到 sum 函数。当然,可以实现另一个聚合函数。

引用