提问人:Peter.2055 提问时间:6/16/2023 最后编辑:Peter.2055 更新时间:6/16/2023 访问量:67
操作数据帧并汇总
Manipulate dataframe and summarise
问:
我有一个大的数据帧,我需要过滤和计算一些汇总统计量,df 如下所示:
ID<-c("KMT1","KMT1","KMT1","KMT2","KMT2","KMT2","KMT3","KMT3","KMT3")
Date<-c("01-03-2015","01-03-2015","01-03-2015",
"04-06-2014","04-06-2014","04-06-2014",
"07-01-2019","07-01-2019","07-01-2019")
TimeUTC<-c("10:22:05","10:22:05","10:22:05",
"10:25:05","10:25:05","10:25:05",
"10:23:05","10:23:05","10:23:05")
V1<-c(0.01,0.003,0.04,0.03,0.02,0.05,0.03,0.1,0.02)
V2<-c(0.02,0.002,0.02,0.003,0.002,0.09,0.01,0.05,0.023)
V3<-c(0.04,0.008,0.06,0.09,0.004,0.05,0.01,0.003,0.04)
V4<-c(0.08,0.009,0.08,0.09,0.004,0.05,0.05,0.03,0.1)
Flag1<-c(0,0,0,1,0,1,0,0,1)
Flag2<-c(0,0,0,0,0,1,1,0,0)
Flag3<-c(0,0,0,0,0,0,1,1,0)
df1<-data.frame(ID,Date,TimeUTC,V1,V2,V3,V4,Flag1,Flag2,Flag3)
df1
ID Date TimeUTC V1 V2 V3 V4 Flag1 Flag2 Flag3
1 KMT1 01-03-2015 10:22:05 0.010 0.020 0.040 0.080 0 0 0
2 KMT1 01-03-2015 10:22:05 0.003 0.002 0.008 0.009 0 0 0
3 KMT1 01-03-2015 10:22:05 0.040 0.020 0.060 0.080 0 0 0
4 KMT2 04-06-2014 10:25:05 0.030 0.003 0.090 0.090 1 0 0
5 KMT2 04-06-2014 10:25:05 0.020 0.002 0.004 0.004 0 0 0
6 KMT2 04-06-2014 10:25:05 0.050 0.090 0.050 0.050 1 1 0
7 KMT3 07-01-2019 10:23:05 0.030 0.010 0.010 0.050 0 1 1
8 KMT3 07-01-2019 10:23:05 0.100 0.050 0.003 0.030 0 0 1
9 KMT3 07-01-2019 10:23:05 0.020 0.023 0.040 0.100 1 0 0
我希望能够根据 ID 列过滤 df,以便我可以评估 ID (0/1) 上是否以及有多少标志处于活动状态,如果是,哪些标志。然后,如果一个标志处于活动状态,我需要删除任何 ID 组,或者如果标志不重要,例如,只有 flag3 处于活动状态,则需要删除其中 1 个条目。
过滤数据后,我想计算每个 ID 的中位数,以便 df 有 2 行。所需的输出将如下所示:
(A) 删除了 ID(KMT3),因为标志 3 处于活动状态
ID Date TimeUTC V1.med V2.med V3.med V4.med
1 KMT1 01-03-2015 10:22:05 0.010 0.020 0.040 0.080
2 KMT2 04-06-2014 10:25:05 0.025 0.025 0.047 0.047
(B) 删除了 ID(KMT2 和 KMT3),因为标志处于活动状态
ID Date TimeUTC V1.med V2.med V3.med V4.med
1 KMT1 01-03-2015 10:22:05 0.01 0.02 0.04 0.08
我是 R 的新手,不确定解决这个问题的最佳方法,我尝试使用 dplyr 包中的 filter(),它可以删除所有包含活动标志的行(例如 Flag1 = 1),但是,我需要评估哪些标志处于活动状态以及如何删除 ID。
我设法使用以下方法计算了每列的中位数:
DT1<-df2 %>%
select("ID",V1:V4)%>%
group_by(ID)%>%
data.table()%>%
na.omit()
setnames(DT1[, sapply(.SD, function(x) list(median(x))), by=ID], c("ID", sapply(names(DT1)[-1], paste0, c(".median"))))
但是,这样做会丢失对后续进程很重要的“日期”和“时间”列。
任何帮助将不胜感激。
答:
0赞
Gregor Thomas
6/16/2023
#1
下面是一些代码,可使每个 ID 获得 1 行,包括标志总数、每种类型的标志数以及列的中位数:V*
library(dplyr)
df1 |>
summarize(
across(starts_with("Flag"), sum, .names = "n_{.col}"),
across(V1:V4, median, .names = "med_{.col}"),
.by = c(ID, Date, TimeUTC)
) |>
mutate(
n_total_flags = rowSums(across(starts_with("n_"))),
)
# ID Date TimeUTC n_Flag1 n_Flag2 n_Flag3 med_V1 med_V2 med_V3 med_V4 n_total_flags
# 1 KMT1 01-03-2015 10:22:05 0 0 0 0.01 0.020 0.04 0.08 0
# 2 KMT2 04-06-2014 10:25:05 2 1 0 0.03 0.003 0.05 0.05 3
# 3 KMT3 07-01-2019 10:23:05 1 1 2 0.03 0.023 0.01 0.05 4
您可以使用此结果执行所需的任何逻辑,并联接到原始数据以进行筛选。
评论
0赞
Peter.2055
6/16/2023
嗨,@GregorThomas,谢谢!这太好了,我不知道 across()。我现在已经编辑了这个问题,它应该说“计算每个 ID 的中位数”而不是每列。这仍然可以用你上面的代码来做吗?
0赞
Gregor Thomas
6/17/2023
您的意思是您想要 V1:V4 列的单个中位数,而不是每个列的中位数?
评论
Flag1 = 1
df2 |> mutate(across(V1:V4, median, .name = "{.col}.med"), .by = ID)