在具有相同后缀的列中,如果存在字符串匹配,则求和

Across columns with same suffix, sum if a there is a string match

提问人:CHW 提问时间:11/11/2023 最后编辑:r2evansCHW 更新时间:11/15/2023 访问量:36

问:

我有一个湖泊调查数据集:

structure(list(station_id = c("SB_L004", "SB_L005", "SB_P005", 
"SB_P007", "SB_P008", "SB_P009", "SB_P015"), zone_a_bin1 = c("Shrub, Wetland, Mixed forest", 
"Wetland, Deciduous forest", "Shrub, Wetland, Coniferous forest", 
"Meadow, Wetland, Deciduous forest", "Grass/Lawn", "Grass/Lawn, Wetland", 
"Grass/Lawn, Meadow"), zone_b_bin1 = c("Shrub, Wetland, Mixed forest", 
"Wetland, Deciduous forest", NA, "Mixed forest", NA, NA, "Meadow, Mixed forest"
), zone_c_bin1 = c("Meadow, Wetland, Mixed forest", "Shrub, Wetland, Deciduous forest", 
NA, NA, NA, NA, NA), zone_d_bin1 = c("Meadow, Shrub, Wetland, Coniferous forest", 
NA, NA, NA, NA, NA, NA), zone_a_prop = c(0.1369, 0.2096, 1, 0.898, 
1, 1, 0.5327), zone_b_prop = c(0.2433, 0.4, 0, 0.102, 0, 0, 0.4673
), zone_c_prop = c(0.1109, 0.3904, 0, 0, 0, 0, 0), zone_d_prop = c(0.5089, 
0, 0, 0, 0, 0, 0)), row.names = c(NA, -7L), class = c("tbl_df", 
"tbl", "data.frame"))

下面是一个子集:

station_id zone_a_bin1 zone_b_bin1 zone_a_prop zone_b_prop
SB_P007 草甸, 湿地, 落叶林 混交林 0.898 0.102
SB_P015 草地/草坪, 草地 草甸, 混交林 0.533 0.467

湖泊被划分为不同长度的海岸线区域,因此每个区域对应于总海岸线的一部分(以“prop”结尾的列)。该数据集的一部分包括记录为逗号分隔值的分类栖息地变量,其中列名表示区域和距离海岸的距离。

对于每个距离条柱,我正在尝试计算包含每种栖息地类型的海岸线比例,并针对区域比例进行校正。使用上面的子集,我正在寻找如下输出:

station_id meadow_bin1
SB_P007 0.898
SB_P015 1.0

对于每个栖息地类型-距离图格组合。 我用 、 和 制作了新列:mutateif_elsestr_detect

dat <- dat %>% 
  mutate(
    meadow_bin1 =
      (if_else(str_detect(zone_a_bin1, regex("Meadow", ignore_case = TRUE)), zone_a_prop, 0, missing = 0) + 
      if_else(str_detect(zone_b_bin1, regex("Meadow", ignore_case = TRUE)), zone_b_prop, 0, missing = 0) + 
      if_else(str_detect(zone_c_bin1, regex("Meadow", ignore_case = TRUE)), zone_c_prop, 0, missing = 0) +
      if_else(str_detect(zone_d_bin1, regex("Meadow", ignore_case = TRUE)), zone_d_prop, 0, missing = 0) 
           )
  )

但是最多有 8 个区域、至少 8 种栖息地类型和三个距离箱,因此上述计划很快就会变得笨拙。必须有更好的方法,也许在距离箱后缀匹配的列中匹配感兴趣的刺痛(上面示例中的“_bin1草地”),如果存在字符串,则将区域比例相加?

r 字符串 if-statement dplyr

评论


答:

0赞 Adrian Maxwell 11/11/2023 #1

如果你将数据集的结构从“短&宽”改为“长&窄”,问题就会变得简单得多。即,多做一行,少做列,通过这种方式转置数据,你的计算变得更容易。

# Load these libraries
library(tidyverse)
library(stringr)

# Replace 'df' with your actual dataframe

# Transform dataframe to long format
df_long <- df %>%
  pivot_longer(
    cols = -station_id,
    names_to = c(".value", "zone", "bin"),
    names_pattern = "(zone_[a-z])_(bin[0-9])_(.*)",
    names_sep = "_"
  )

# Split habitat column into multiple rows
df_long <- df_long %>%
  separate_rows(habitat, sep = ", ")

# Calculate proportion for each habitat within each station_id and bin combination
df_final <- df_long %>%
  group_by(station_id, bin, habitat) %>%
  summarise(proportion = sum(prop), .groups = 'drop')

为了帮助可视化,如果您打印“df_long”(见上文),它应该如下所示:

   station_id zone   bin   prop habitat          
   <chr>      <chr>  <chr> <dbl> <chr>            
 1 SB_L004    zone_a bin1  0.137 Shrub            
 2 SB_L004    zone_a bin1  0.137 Wetland          
 3 SB_L004    zone_a bin1  0.137 Mixed forest     
 ... more rows

修订我在 https://rdrr.io/snippets/ 尝试过这个,它运行没有错误,但我不确定最终分组是否正确:

# Load libraries
library(tidyverse)
library(stringr)

# Assign the structure to df
df <- structure(list(station_id = c("SB_L004", "SB_L005", "SB_P005", 
"SB_P007", "SB_P008", "SB_P009", "SB_P015"), zone_a_bin1 = c("Shrub, Wetland, Mixed forest", 
"Wetland, Deciduous forest", "Shrub, Wetland, Coniferous forest", 
"Meadow, Wetland, Deciduous forest", "Grass/Lawn", "Grass/Lawn, Wetland", 
"Grass/Lawn, Meadow"), zone_b_bin1 = c("Shrub, Wetland, Mixed forest", 
"Wetland, Deciduous forest", NA, "Mixed forest", NA, NA, "Meadow, Mixed forest"
), zone_c_bin1 = c("Meadow, Wetland, Mixed forest", "Shrub, Wetland, Deciduous forest", 
NA, NA, NA, NA, NA), zone_d_bin1 = c("Meadow, Shrub, Wetland, Coniferous forest", 
NA, NA, NA, NA, NA, NA), zone_a_prop = c(0.1369, 0.2096, 1, 0.898, 
1, 1, 0.5327), zone_b_prop = c(0.2433, 0.4, 0, 0.102, 0, 0, 0.4673
), zone_c_prop = c(0.1109, 0.3904, 0, 0, 0, 0, 0), zone_d_prop = c(0.5089, 
0, 0, 0, 0, 0, 0)), row.names = c(NA, -7L), class = c("tbl_df", 
"tbl", "data.frame"))

# Load libraries
library(tidyverse)
library(stringr)

# Assign the structure to df
# ... (same as before)

# Transform dataframe to long format - step 1 for "bin1" columns
df_long_bin1 <- df %>%
  pivot_longer(
    cols = ends_with("bin1"),
    names_to = "zone",
    names_prefix = "zone_",
    values_to = "habitat"
  ) %>%
  mutate(bin = "bin1")

# Transform dataframe to long format - step 2 for "prop" columns
df_long_prop <- df %>%
  pivot_longer(
    cols = ends_with("prop"),
    names_to = "zone",
    names_prefix = "zone_",
    values_to = "prop"
  ) %>%
  mutate(bin = "prop")

# Join the two dataframes together
df_long <- full_join(df_long_bin1, df_long_prop, by = c("station_id", "zone", "bin"))

# Split habitat column into multiple rows
df_long <- df_long %>%
  separate_rows(habitat, sep = ", ")

# Filter out rows where bin is "prop" and habitat is NA
df_long <- df_long %>%
  filter(!(bin == "prop" & is.na(habitat)))

# Calculate proportion for each habitat within each station_id and zone combination
df_final <- df_long %>%
  group_by(station_id, zone, habitat) %>%
  summarise(proportion = sum(prop, na.rm = TRUE), .groups = 'drop')

# Print the result
print(df_final)

结果看起来像这样,所以你可能需要调整上面的内容。

   station_id zone   habitat           proportion
   <chr>      <chr>  <chr>                  <dbl>
 1 SB_L004    a_bin1 Mixed forest               0
 2 SB_L004    a_bin1 Shrub                      0
 3 SB_L004    a_bin1 Wetland                    0
 4 SB_L004    b_bin1 Mixed forest               0
 5 SB_L004    b_bin1 Shrub                      0
 6 SB_L004    b_bin1 Wetland                    0
 7 SB_L004    c_bin1 Meadow                     0
 8 SB_L004    c_bin1 Mixed forest               0
 9 SB_L004    c_bin1 Wetland                    0
10 SB_L004    d_bin1 Coniferous forest          0
# ... with 38 more rows

底线和重要的信息是,为了避免难以忍受的复杂代码,您需要将 df 从宽/短结构转换为窄/长结构。一旦数据被“规范化”,对数字进行分组/聚合的代码将简单得多。

评论

0赞 CHW 11/15/2023
感谢您的回复,尽管我无法让您的“pivot_longer”代码正常工作:“如果您提供多个名称,您还必须提供其中一个或”。为了尝试完成这项工作,我删除了最后一行('names_sep')并编辑了示例数据以具有相似的列名(现在格式为'(zone_[a-z}_(bin[0-9]))_(habitat or prop),错误是:“无法将<字符>和<double>组合在一起。此外,以“_prop”结尾的列适用于同一区域中的每个箱,尽管我可以单独处理每个箱。names_tonames_sepnames_patternzone_a_bin1_habitatzone_a_bin1_prop
0赞 Adrian Maxwell 11/15/2023
还行。我有一个修改后的方法,但请注意,我不是在尝试编写最终代码,只是想指出,一旦你正确地布局了转换,就需要将你的数据从宽到窄(透视到非透视),聚合应该很简单。
1赞 CHW 11/16/2023
非常感谢 - 通过使用这个例子,我能够让一切正常。我通过重命名“区域”列,使它们在“_prop”和“_bin1”透视数据之间匹配,以及其他一些小的编辑,从而防止了全零的最终表。再次感谢!
0赞 Adrian Maxwell 11/16/2023
很高兴听到你得到了一个结果。现在,您将知道要避免聚合的宽布局。:-)