如何获得整洁的填充函数以在填充数据行时使用组

How to get tidy fill function to use groups when filling data rows

提问人:user8229029 提问时间:11/13/2023 最后编辑:neilfwsuser8229029 更新时间:11/15/2023 访问量:78

问:

我有具有共同日期和记录编号的数据行,但每条记录中都有不同的列包含数据,例如以下数据:

my_data <- structure(list(Station_Number = c("0045", "0045", "0045", "0045"), 
Station_Name = c("Scottsbluff 2NW", "Scottsbluff 2NW", "Scottsbluff 2NW", "Scottsbluff 
2NW"), NWSLI = c("SBMN1", "SBMN1", "SBMN1", "SBMN1"), Station_Legacy_Name = 
c("Scottsbluff", "Scottsbluff", "Scottsbluff", "Scottsbluff"), Lat = c(41.89175, 
41.89175, 41.89175, 41.89175), Lon = c(-103.6814722, -103.6814722, -103.6814722, 
-103.6814722), Elev_m = c("1198.54", "1198.54", "1198.54", "1198.54"), TIMESTAMP = 
c("2017-06-18 04:00:00","2017-06-18 04:00:00", "2017-06-18 04:01:00", "2017-06-18 04:01:00"), RECORD = c(214037L, 214037L, 214038L, 214038L), Ta_2m_Avg = c(NA_real_, 
NA_real_, NA_real_, NA_real_), TaMax_2m = c(NA_real_, NA_real_, NA_real_, NA_real_), 
LithBatt_Min = c(NA_real_, NA_real_, NA_real_, NA_real_), MaintMode = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_), WndSpd5s_3m_Avg = c(1.373, NA, 1.786, NA), 
WndMaxSpd5s_3m_Avg = c(NA, 1.373, NA, 1.786), source = 
c("E:\\Data_Collected_withDups/Scottsbluff/NE-Scottsbluff-CR1000-Table018-dat", 
"E:\\Data_Collected_withDups/Scottsbluff/NE-Scottsbluff-CR1000-Table018-dat.1.backup", 
"E:\\Data_Collected_withDups/Scottsbluff/NE-Scottsbluff-CR1000-Table018-dat", 
"E:\\Data_Collected_withDups/Scottsbluff/NE-Scottsbluff-CR1000-Table018-dat.1.backup"
)), row.names = c(2989579L, 5089877L, 2989580L, 5089878L), class = "data.frame")

忽略最后一列,我想按 TIMESTAMP 分组并填充所有内容。我尝试了各种版本的

my_data[,1:15] |> group_by(TIMESTAMP) |> tidyr::fill(everything(), .direction = 'up') |> 
head()

这难道不应该生成一个包含两行的数据框,其中填充了“WndSpd5s_3m_Avg”和“WndMaxSpd5s_3m_Avg”变量吗?我尝试了很多版本,但没有运气。在这种情况下,我不知道如何/是否使用代码的 head() 部分。任何帮助都会很棒。

r dplyr tidyverse 填充

评论

0赞 Jon Spring 11/13/2023
我会尝试用“slice(n=1)”代替 head 来获得每组的第一行。
0赞 Mark 11/15/2023
@JonSpring缺失值替代,请参阅我的答案:-)
0赞 Mark 11/15/2023
re:使用 OP,你不需要它,但如果你使用 dput() 它会很有帮助head()

答:

1赞 seansteele 11/13/2023 #1

这是你要做的吗?

library(dplyr)

my_data |>
  group_by(TIMESTAMP) |>
  mutate(WndSpd5s_3m_Avg = max(WndSpd5s_3m_Avg, na.rm = TRUE),
         WndMaxSpd5s_3m_Avg = max(WndMaxSpd5s_3m_Avg, na.rm = TRUE)) |>
  slice_head(n = 1) |>
  select(TIMESTAMP, WndSpd5s_3m_Avg, WndMaxSpd5s_3m_Avg)

# A tibble: 2 × 3
# Groups:   TIMESTAMP [2]
  TIMESTAMP           WndSpd5s_3m_Avg WndMaxSpd5s_3m_Avg
  <chr>                         <dbl>              <dbl>
1 2017-06-18 04:00:00            1.37               1.37
2 2017-06-18 04:01:00            1.79               1.79

我曾经只选择这 3 列,如果您删除该行,它将返回所有列。如果你想保留它,还要记住把它分配给一个变量。select()

2赞 maike 11/13/2023 #2

假设您想要获取每个时间戳的 WndSpd 列的(单个)非缺失条目,您可以简单地执行以下操作:

library(tidyverse)

my_data %>% 
  group_by(TIMESTAMP) %>%  
  summarize(across(c('WndSpd5s_3m_Avg', 'WndMaxSpd5s_3m_Avg'), na.omit))
#> # A tibble: 2 × 3
#>   TIMESTAMP           WndSpd5s_3m_Avg WndMaxSpd5s_3m_Avg
#>   <chr>                         <dbl>              <dbl>
#> 1 2017-06-18 04:00:00            1.37               1.37
#> 2 2017-06-18 04:01:00            1.79               1.79

创建于 2023-11-13 with reprex v2.0.2

使用填充不会改变数据集的行数(就像 mutate 一样),相反,summarize 最终会每组一行(这里是唯一时间戳的数量)

评论

0赞 Mark 11/15/2023
旁注:如果要保留其他列,可以使用summarise(my_data, across(-starts_with("Wnd"), first), across(starts_with("Wnd"), na.omit), .by = "TIMESTAMP")
0赞 Mark 11/15/2023 #3

如果您只是在寻找一种获取具有两行的数据帧的方法,其他答案都很棒 - 不过,我想回答您的问题,并尝试解释为什么您的代码无法按照您想要的方式工作。

该函数会填充上一行或下一行中缺失的 NA 值,具体取决于您的设置方式。它不会删除(或添加)行。fill()

这是数据最初的样子:

# A tibble: 4 × 15
  Station_Number Station_Name        NWSLI Station_Legacy_Name   Lat   Lon
  <chr>          <chr>               <chr> <chr>               <dbl> <dbl>
1 0045           "Scottsbluff 2NW"   SBMN1 Scottsbluff          41.9 -104.
2 0045           "Scottsbluff 2NW"   SBMN1 Scottsbluff          41.9 -104.
3 0045           "Scottsbluff 2NW"   SBMN1 Scottsbluff          41.9 -104.
4 0045           "Scottsbluff \n2NW" SBMN1 Scottsbluff          41.9 -104.
  Elev_m  TIMESTAMP           RECORD Ta_2m_Avg TaMax_2m LithBatt_Min MaintMode
  <chr>   <chr>                <int>     <dbl>    <dbl>        <dbl>     <int>
1 1198.54 2017-06-18 04:00:00 214037        NA       NA           NA        NA
2 1198.54 2017-06-18 04:00:00 214037        NA       NA           NA        NA
3 1198.54 2017-06-18 04:01:00 214038        NA       NA           NA        NA
4 1198.54 2017-06-18 04:01:00 214038        NA       NA           NA        NA
  WndSpd5s_3m_Avg WndMaxSpd5s_3m_Avg
            <dbl>              <dbl>
1            1.37              NA   
2           NA                  1.37
3            1.79              NA   
4           NA                  1.79

有两个时间戳,每个时间戳有两行。风速列中的缺失值形成了一种复选框模式。

当您的代码运行 (with ) 时,风速列将更改为以下值:direction = 'up'

  WndSpd5s_3m_Avg WndMaxSpd5s_3m_Avg
            <dbl>              <dbl>
1            1.37               1.37
2           NA                  1.37
3            1.79               1.79
4           NA                  1.79

如您所见,WndMaxSpd5s_3m_Avg值已填写,但未填写WndSpd5s_3m_Avg值。如果您将“向上”切换到“向下”,情况正好相反。

解决方案是将方向更改为“向下”(即先向下然后向上)或“向上”(先向上然后向下)(如果每个时间戳有多个非 NA 值,哪个可能很重要,但看到你没有,这并不重要)。