删除基于 3-4 列的重复项 (dplyr)

Removing duplicates based on 3-4 columns (dplyr)

提问人:maldini1990 提问时间:10/20/2023 更新时间:10/20/2023 访问量:45

问:

我意识到以前可能有人问过这个问题,但我正在努力正确删除我的 df 中的重复项。我使用了此处推荐的方法,但它并没有删除所有重复项。

#install 套餐

#Loading packages
library(tidyverse)
library(readxl)
library(writexl)
library(stringr)
library(textclean)
library(lubridate)

这是我的数据:

dput(df[1:10,c(1,2,3,4,5,6,7)])

数据输出:

structure(list(username = c("Engineeer", "ftpofmpo", "sagood",
"ishtarsg", "Ohayo!", "Engineeer"), post = c("Engineers are si ginnas who recently graduated from Universities. No one stays as an Engineer like forever.\nEngineering is harder than Business but more fulfilling in the long run.\nEngineer > Manager > Director > Chief Technology Officer > Chief Executive Officer\n\tzero to sixty times",
"\n\t\n\t\t\n\t\t\t\n\t\t\t\tEngineeer said:\n\t\t\t\n\t\t\n\t\n\t\n\t\t\n\t\t\n\t\t\tThen pick up Engineering. Its harder but more fulfilling in the long run. No one stays as an Engineer like forever.\nEngineer > Manager > Director > Chief Technical Officer > Chief Executive\n\t\t\n\t\tClick to expand...\n\t\n\nhave you seen the past list of president scholars?\nif minister salary pegg to engineer pay jialat liao... check out lky statement on y salary must be high",
"i thought engineering ish dominated by ceca?????", "Always opt to be a priest.",
"after CEO beome mayor then minister?", "\n\t\n\t\t\n\t\t\t\n\t\t\t\tsagood said:\n\t\t\t\n\t\t\n\t\n\t\n\t\t\n\t\t\n\t\t\ti thought engineering ish dominated by ceca?????\n\t\t\n\t\tClick to expand...\n\t\nIf you fret Engineering its fine. Donate these good paying jobs to CECAs."
), date = structure(c(1622851200, 1622851200, 1622851200, 1622851200,
1622851200, 1622851200), tzone = "UTC", class = c("POSIXct",
"POSIXt")), user_status = c("Supremacy Member", "Banned", "Member",
"Arch-Supremacy Member", "Great Supremacy Member", "Supremacy Member"
), treatment_implementation = c(0, 0, 0, 0, 0, 0), month_year = c(2021.41666666667,
2021.41666666667, 2021.41666666667, 2021.41666666667, 2021.41666666667,
2021.41666666667), id = c(255, 296, 747, 389, 634, 255)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))

要删除重复的行,我将根据以下三列进行删除:

# Drop duplicate observations
df <-
df %>%
filter(duplicated(cbind(username, post, date)))

运行上述代码后,当我手动检查数据时,我仍然看到重复的行。此外,当我在第一次重复删除尝试后再次运行上面的相同代码时,它会不断删除更多行,这令人困惑,因为我认为应该在一次尝试中删除所有重复的行(即只运行一次代码时)。

R dplyr tidyr 润滑剂 纵梁

评论

0赞 bdedu 10/20/2023
既然您要删除重复项,那么是否应该通过添加 ?!df <- df %>% filter(!duplicated(cbind(username, post, date)))

答:

2赞 Lucca Nielsen 10/20/2023 #1

您可以使用包中的函数来实现根据特定列筛选出重复项的目标。distinctdplyr

df <- df %>%
  distinct(username, post, date, .keep_all = T)