将 groupby 结果加权 sum（a）/sum（b）广播为原始 DataFrame 中的新列-解网

问：

我正在尝试根据按“c”分组的数据帧中的两列“a”、“b”在 Pandas 数据帧中创建一个新列“ab_weighted”。

具体来说，我正在尝试复制此 R 代码的输出：

library(data.table)

df = data.table(a = 1:6, 
            b = 7:12,
            c = c('q', 'q', 'q', 'q', 'w', 'w')
            )

df[, ab_weighted := sum(a)/sum(b), by = "c"]
df[, c('c', 'a', 'b', 'ab_weighted')]

输出：

到目前为止，我在 Python 中尝试了以下方法：

import pandas as pd

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[7,8,9,10,11,12],
               'c':['q', 'q', 'q', 'q', 'w', 'w']
              })

df.groupby(['c'])['a', 'b'].apply(lambda x: sum(x['a'])/sum(x['b']))

输出：

当我在上面的代码中更改时，出现错误： TypeError：需要整数applytransform

transform()如果我只使用一列，工作正常：

import pandas as pd

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[7,8,9,10,11,12],
               'c':['q', 'q', 'q', 'q', 'w', 'w']
              })

 df.groupby(['c'])['a', 'b'].transform(lambda x: sum(x))

但显然，这不是同一个答案：

有没有办法从Pandas中的R代码中获取结果，而无需生成中间列（即使用pandas直接生成最后一列（））？data.tabletransformab_weighted = sum(a)/sum(b)

Python pandas 数据帧分组

df['ab_weighted'] = \
df.groupby('c', group_keys = False)['a', 'b'].apply(
    lambda x: pd.Series(x.a.sum()/x.b.sum(), 
                        index = x.index).to_frame()
).iloc[:,0]
print(df)

# output 
#    a   b  c  ab_weighted
# 0  1   7  q     0.294118
# 1  2   8  q     0.294118
# 2  3   9  q     0.294118
# 3  4  10  q     0.294118
# 4  5  11  w     0.478261
# 5  6  12  w     0.478261

-1赞 Contango 3/28/2021 #4

2021-03-28 更新：我不推荐这个答案;我会推荐我的另一个，因为它更干净，更高效。

试试@BENY的答案。如果它不起作用，则可能是由于索引不同。

下面的解决方案很丑陋，而且更复杂，但它应该提供足够的线索来使其适用于任何数据帧，而不仅仅是玩具帧。这是 pandas 的一个领域，不可否认，API 很笨拙且容易出错，有时根本没有干净的方法来获得任何有效的结果，而无需大量跳过箍。

诀窍是确保公共索引可用且具有相同的名称。

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[7,8,9,10,11,12],
               'c':['q', 'q', 'q', 'q', 'w', 'w']
              })

df.reset_index(drop=True, inplace=True)

values = df.groupby(['c']).apply(lambda x: sum(x['a'])/sum(x['b']))
# Convert result to dataframe.
df_to_join = values.to_frame()

# Ensure indexes have common names.
df_to_join.index.set_names(["index"], inplace=True)
df.set_index("c", inplace=True)
df.index.set_names(["index"], inplace=True)

# Set column name of result we want.
df_to_join.rename(columns={0: "ab_weighted"}, inplace=True, errors='raise')

# Join result of groupby to original dataframe.
df_result = df.merge(df_to_join, on=["index"])
print(df_result)

# output 
       a   b  ab_weighted
index                    
q      1   7     0.294118
q      2   8     0.294118
q      3   9     0.294118
q      4  10     0.294118
w      5  11     0.478261
w      6  12     0.478261

并将索引转换回列：c

df_result.reset_index(inplace=True)
df_result.rename(columns={"index": "c"}, inplace=True)

import numpy as np
import pandas as pd

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[7,8,9,10,11,12],
               'c':['q', 'q', 'q', 'q', 'w', 'w']
              })

def groupby_transform(df: pd.DataFrame, group_by_column: str, lambda_to_apply) -> np.array:
    """
    Groupby and transform. Returns a column for the original dataframe.
    :param df: Dataframe.
    :param group_by_column: Column(s) to group by.
    :param lambda_to_apply: Lambda.
    :return: Column to append to original dataframe.
    """
    df = df.reset_index(drop=True)  # Dataframe index is now strictly in order of the rows in the original dataframe.
    values = df.groupby(group_by_column).apply(lambda_to_apply)
    values.sort_index(level=1, inplace=True)  # Sorts result into order of original rows in dataframe (as groupby will undo that order when it groups).
    result = np.array(values)  # Sort rows into same order as original dataframe.
    if result.shape[0] == 1:  # e.g. if shape is (1,1003), make it (1003,).
        result = result[0]
    return result  # Return column.


df["result"] = groupby_transform(df, "c", lambda x: x["a"].shift(1) + x["b"].shift(1))

输出：

   a   b  c  result
0  1   7  q     NaN
1  2   8  q     8.0
2  3   9  q    10.0
3  4  10  q    12.0
4  5  11  w     NaN
5  6  12  w    16.0

与上面的 Pandas 扩展相同：

@pd.api.extensions.register_dataframe_accessor("ex")
class GroupbyTransform:
    """
    Groupby and transform. Returns a column for the original dataframe.
    """
    def __init__(self, pandas_obj):
        self._validate(pandas_obj)
        self._obj = pandas_obj

    @staticmethod
    def _validate(obj):
        # TODO: Check that dataframe is sorted, throw if not.
        pass

    def groupby_transform(self, group_by_column: str, lambda_to_apply):
        """
        Groupby and transform. Returns a column for the original dataframe.
        :param df: Dataframe.
        :param group_by_column: Column(s) to group by.
        :param lambda_to_apply: Lambda.
        :return: Column to append to original dataframe.
        """
        df = self._obj.reset_index(drop=True)  # Dataframe index is now strictly in order of the rows in the original dataframe.
        values = df.groupby(group_by_column).apply(lambda_to_apply)
        values.sort_index(level=1, inplace=True)  # Sorts result into order of original rows in dataframe (as groupby will undo that order when it groups).
        result = np.array(values)
        if result.shape[0] == 1:  # e.g. if shape is (1,1003), make it (1003,).
            result = result[0]
        return result

这给出了与之前相同的输出：

df["result"] = df.ex.groupby_transform("c", lambda x: x["a"].shift(1) + x["b"].shift(1))

与 R 的 data.table 表达式不同，pd.transform（） 一次只对单个列起作用。因此，它可以分别对“a”列求和，然后是“b”列，但它无法同时看到“a”的总和和“b”的总和，这涉及到跨行读取。
为什么我们没有像你那样努力？因为然后将整个组折叠（/聚合）为一行，并丢弃索引;如果我们想分配回源 DataFrame，则不是我们想要的。因此，我们执行 .df.groupby('c')[['a','b']].sum().sum().transform(pd.Series.sum, axis=0)
然后我们做或计算比率 a/b.apply(..., axis=1).agg()
最后，我们可以重新分配给新列 df['ab_weighted']，因为我们保留了原始索引
我们可以改用 .assign（）：，但这很烦人，因为“c”被删除了（pandas 目前有一个持续的问题，无法正常工作）。df.groupby('c').transform(pd.Series.sum, axis=0).assign(ab_weighted = lambda x: x.a/x.b)groupby(..., as_index=False)
熊猫的另一个复杂表达式技巧曾经是中间结果，比如说df.merge(df_abw, on='c')df_abw = df.groupby('c').apply(pd.Series.sum, axis=0).apply(lambda x: x.a/x.b, axis=1).rename('ab_weighted')
下面是一个简短的单行解决方案，它不保留索引：

_

df.groupby('c')[['a','b']].sum().assign(ab_weighted = lambda x: x.a/x.b)

    a   b  ab_weighted
c                     
q  10  34     0.294118
w  11  23     0.478261

上一个：使用 HUGO 的 .md 文件中的 R blogdown 包中的数学问题

下一个：data.table dplyr full_join等效修改就地

将 groupby 结果加权 sum（a）/sum（b）广播为原始 DataFrame 中的新列

Broadcast groupby result weighted sum(a)/sum(b) as new column in original DataFrame

评论

评论

评论

将 groupby 结果加权 sum（a）/sum（b） 广播为原始 DataFrame 中的新列

Broadcast groupby result weighted sum(a)/sum(b) as new column in original DataFrame

评论

评论

评论

将 groupby 结果加权 sum（a）/sum（b）广播为原始 DataFrame 中的新列