在数据帧中查找重复项并仅保留最高的重复项-解网

问：

我正在尝试在数据帧中查找每个组较高的重复项，以便稍后可以根据索引从另一个数据帧中删除这些重复项，以便主数据帧没有重复项，只有最低值。

基本上，假设我们有这个数据帧：

index   group   value
  1       1      402
  2       1      396
  3       2      406
  4       2      416
  5       2      407
  6       2      406
  7       1      200
  8       2      350

我需要的是只保留每组具有最高值的连续重复项中的重复项，并删除最低值的重复项。该组为 1 或 2，但同一组中可以有多个连续值的实例。因此，生成的数据帧将是：

index   group   value
  1       1      402
  4       2      416
  5       2      407

速度也很重要，不能向前看。

Python Pandas DataFrame CSV 数据操作

# map each consecutive group of rows to a different integer
group_labels = (df.group != df.group.shift()).cumsum()

# find the minimum value of each group 
group_min_val = df.groupby(group_labels)['value'].transform('min')

# get only the rows of each group whose value is higher than the minimum 
res = df[df.value != group_min_val]

>>> res

   index  group  value
0      1      1    402
3      4      2    416
4      5      2    407

中间结果


>>> group_labels

0    1
1    1
2    2
3    2
4    2
5    2
6    3
7    4
Name: group, dtype: int64

>>> group_min_val

0    396
1    396
2    406
3    406
4    406
5    406
6    200
7    350
Name: value, dtype: int64

>>> df.value != group_min_val

0     True
1    False
2    False
3     True
4     True
5    False
6    False
7    False
Name: value, dtype: bool

df = pd.DataFrame({'index': [1, 2, 3, 4, 5, 6, 7], 'group': [1, 1, 2, 2, 2, 1, 2],
                   'value': [402, 396, 406, 416, 407, 200, 350]}).set_index('index')
print('Source df:\n', df)
df = df[df.groupby(df.group.diff().ne(0).cumsum())['value'].rank(method='first').gt(1)]
print('\nResult df:\n', df)

输出：

Source df:
        group  value
index              
1          1    402
2          1    396
3          2    406
4          2    416
5          2    407
6          1    200
7          2    350

Result df:
        group  value
index              
1          1    402
4          2    416
5          2    407

在数据帧中查找重复项并仅保留最高的重复项

Find duplicates in dataframe and keep only the highest ones

评论

评论

评论