提问人:user8229029 提问时间:11/13/2023 最后编辑:neilfwsuser8229029 更新时间:11/15/2023 访问量:78
如何获得整洁的填充函数以在填充数据行时使用组
How to get tidy fill function to use groups when filling data rows
问:
我有具有共同日期和记录编号的数据行,但每条记录中都有不同的列包含数据,例如以下数据:
my_data <- structure(list(Station_Number = c("0045", "0045", "0045", "0045"),
Station_Name = c("Scottsbluff 2NW", "Scottsbluff 2NW", "Scottsbluff 2NW", "Scottsbluff
2NW"), NWSLI = c("SBMN1", "SBMN1", "SBMN1", "SBMN1"), Station_Legacy_Name =
c("Scottsbluff", "Scottsbluff", "Scottsbluff", "Scottsbluff"), Lat = c(41.89175,
41.89175, 41.89175, 41.89175), Lon = c(-103.6814722, -103.6814722, -103.6814722,
-103.6814722), Elev_m = c("1198.54", "1198.54", "1198.54", "1198.54"), TIMESTAMP =
c("2017-06-18 04:00:00","2017-06-18 04:00:00", "2017-06-18 04:01:00", "2017-06-18 04:01:00"), RECORD = c(214037L, 214037L, 214038L, 214038L), Ta_2m_Avg = c(NA_real_,
NA_real_, NA_real_, NA_real_), TaMax_2m = c(NA_real_, NA_real_, NA_real_, NA_real_),
LithBatt_Min = c(NA_real_, NA_real_, NA_real_, NA_real_), MaintMode = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_), WndSpd5s_3m_Avg = c(1.373, NA, 1.786, NA),
WndMaxSpd5s_3m_Avg = c(NA, 1.373, NA, 1.786), source =
c("E:\\Data_Collected_withDups/Scottsbluff/NE-Scottsbluff-CR1000-Table018-dat",
"E:\\Data_Collected_withDups/Scottsbluff/NE-Scottsbluff-CR1000-Table018-dat.1.backup",
"E:\\Data_Collected_withDups/Scottsbluff/NE-Scottsbluff-CR1000-Table018-dat",
"E:\\Data_Collected_withDups/Scottsbluff/NE-Scottsbluff-CR1000-Table018-dat.1.backup"
)), row.names = c(2989579L, 5089877L, 2989580L, 5089878L), class = "data.frame")
忽略最后一列,我想按 TIMESTAMP 分组并填充所有内容。我尝试了各种版本的
my_data[,1:15] |> group_by(TIMESTAMP) |> tidyr::fill(everything(), .direction = 'up') |>
head()
这难道不应该生成一个包含两行的数据框,其中填充了“WndSpd5s_3m_Avg”和“WndMaxSpd5s_3m_Avg”变量吗?我尝试了很多版本,但没有运气。在这种情况下,我不知道如何/是否使用代码的 head() 部分。任何帮助都会很棒。
答:
这是你要做的吗?
library(dplyr)
my_data |>
group_by(TIMESTAMP) |>
mutate(WndSpd5s_3m_Avg = max(WndSpd5s_3m_Avg, na.rm = TRUE),
WndMaxSpd5s_3m_Avg = max(WndMaxSpd5s_3m_Avg, na.rm = TRUE)) |>
slice_head(n = 1) |>
select(TIMESTAMP, WndSpd5s_3m_Avg, WndMaxSpd5s_3m_Avg)
# A tibble: 2 × 3
# Groups: TIMESTAMP [2]
TIMESTAMP WndSpd5s_3m_Avg WndMaxSpd5s_3m_Avg
<chr> <dbl> <dbl>
1 2017-06-18 04:00:00 1.37 1.37
2 2017-06-18 04:01:00 1.79 1.79
我曾经只选择这 3 列,如果您删除该行,它将返回所有列。如果你想保留它,还要记住把它分配给一个变量。select()
假设您想要获取每个时间戳的 WndSpd 列的(单个)非缺失条目,您可以简单地执行以下操作:
library(tidyverse)
my_data %>%
group_by(TIMESTAMP) %>%
summarize(across(c('WndSpd5s_3m_Avg', 'WndMaxSpd5s_3m_Avg'), na.omit))
#> # A tibble: 2 × 3
#> TIMESTAMP WndSpd5s_3m_Avg WndMaxSpd5s_3m_Avg
#> <chr> <dbl> <dbl>
#> 1 2017-06-18 04:00:00 1.37 1.37
#> 2 2017-06-18 04:01:00 1.79 1.79
创建于 2023-11-13 with reprex v2.0.2
使用填充不会改变数据集的行数(就像 mutate 一样),相反,summarize 最终会每组一行(这里是唯一时间戳的数量)
评论
summarise(my_data, across(-starts_with("Wnd"), first), across(starts_with("Wnd"), na.omit), .by = "TIMESTAMP")
如果您只是在寻找一种获取具有两行的数据帧的方法,其他答案都很棒 - 不过,我想回答您的问题,并尝试解释为什么您的代码无法按照您想要的方式工作。
该函数会填充上一行或下一行中缺失的 NA 值,具体取决于您的设置方式。它不会删除(或添加)行。fill()
这是数据最初的样子:
# A tibble: 4 × 15
Station_Number Station_Name NWSLI Station_Legacy_Name Lat Lon
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 0045 "Scottsbluff 2NW" SBMN1 Scottsbluff 41.9 -104.
2 0045 "Scottsbluff 2NW" SBMN1 Scottsbluff 41.9 -104.
3 0045 "Scottsbluff 2NW" SBMN1 Scottsbluff 41.9 -104.
4 0045 "Scottsbluff \n2NW" SBMN1 Scottsbluff 41.9 -104.
Elev_m TIMESTAMP RECORD Ta_2m_Avg TaMax_2m LithBatt_Min MaintMode
<chr> <chr> <int> <dbl> <dbl> <dbl> <int>
1 1198.54 2017-06-18 04:00:00 214037 NA NA NA NA
2 1198.54 2017-06-18 04:00:00 214037 NA NA NA NA
3 1198.54 2017-06-18 04:01:00 214038 NA NA NA NA
4 1198.54 2017-06-18 04:01:00 214038 NA NA NA NA
WndSpd5s_3m_Avg WndMaxSpd5s_3m_Avg
<dbl> <dbl>
1 1.37 NA
2 NA 1.37
3 1.79 NA
4 NA 1.79
有两个时间戳,每个时间戳有两行。风速列中的缺失值形成了一种复选框模式。
当您的代码运行 (with ) 时,风速列将更改为以下值:direction = 'up'
WndSpd5s_3m_Avg WndMaxSpd5s_3m_Avg
<dbl> <dbl>
1 1.37 1.37
2 NA 1.37
3 1.79 1.79
4 NA 1.79
如您所见,WndMaxSpd5s_3m_Avg值已填写,但未填写WndSpd5s_3m_Avg值。如果您将“向上”切换到“向下”,情况正好相反。
解决方案是将方向更改为“向下”(即先向下然后向上)或“向上”(先向上然后向下)(如果每个时间戳有多个非 NA 值,哪个可能很重要,但看到你没有,这并不重要)。
评论
head()