在具有相同后缀的列中，如果存在字符串匹配，则求和-解网

问：

我有一个湖泊调查数据集：

structure(list(station_id = c("SB_L004", "SB_L005", "SB_P005", 
"SB_P007", "SB_P008", "SB_P009", "SB_P015"), zone_a_bin1 = c("Shrub, Wetland, Mixed forest", 
"Wetland, Deciduous forest", "Shrub, Wetland, Coniferous forest", 
"Meadow, Wetland, Deciduous forest", "Grass/Lawn", "Grass/Lawn, Wetland", 
"Grass/Lawn, Meadow"), zone_b_bin1 = c("Shrub, Wetland, Mixed forest", 
"Wetland, Deciduous forest", NA, "Mixed forest", NA, NA, "Meadow, Mixed forest"
), zone_c_bin1 = c("Meadow, Wetland, Mixed forest", "Shrub, Wetland, Deciduous forest", 
NA, NA, NA, NA, NA), zone_d_bin1 = c("Meadow, Shrub, Wetland, Coniferous forest", 
NA, NA, NA, NA, NA, NA), zone_a_prop = c(0.1369, 0.2096, 1, 0.898, 
1, 1, 0.5327), zone_b_prop = c(0.2433, 0.4, 0, 0.102, 0, 0, 0.4673
), zone_c_prop = c(0.1109, 0.3904, 0, 0, 0, 0, 0), zone_d_prop = c(0.5089, 
0, 0, 0, 0, 0, 0)), row.names = c(NA, -7L), class = c("tbl_df", 
"tbl", "data.frame"))

下面是一个子集：

station_id	zone_a_bin1	zone_b_bin1	zone_a_prop	zone_b_prop
SB_P007	草甸，湿地，落叶林	混交林	0.898	0.102
SB_P015	草地/草坪，草地	草甸，混交林	0.533	0.467

湖泊被划分为不同长度的海岸线区域，因此每个区域对应于总海岸线的一部分（以“prop”结尾的列）。该数据集的一部分包括记录为逗号分隔值的分类栖息地变量，其中列名表示区域和距离海岸的距离。

对于每个距离条柱，我正在尝试计算包含每种栖息地类型的海岸线比例，并针对区域比例进行校正。使用上面的子集，我正在寻找如下输出：

station_id	meadow_bin1
SB_P007	0.898
SB_P015	1.0

对于每个栖息地类型-距离图格组合。我用、和制作了新列：mutateif_elsestr_detect

dat <- dat %>% 
  mutate(
    meadow_bin1 =
      (if_else(str_detect(zone_a_bin1, regex("Meadow", ignore_case = TRUE)), zone_a_prop, 0, missing = 0) + 
      if_else(str_detect(zone_b_bin1, regex("Meadow", ignore_case = TRUE)), zone_b_prop, 0, missing = 0) + 
      if_else(str_detect(zone_c_bin1, regex("Meadow", ignore_case = TRUE)), zone_c_prop, 0, missing = 0) +
      if_else(str_detect(zone_d_bin1, regex("Meadow", ignore_case = TRUE)), zone_d_prop, 0, missing = 0) 
           )
  )

但是最多有 8 个区域、至少 8 种栖息地类型和三个距离箱，因此上述计划很快就会变得笨拙。必须有更好的方法，也许在距离箱后缀匹配的列中匹配感兴趣的刺痛（上面示例中的“_bin1草地”），如果存在字符串，则将区域比例相加？

r 字符串 if-statement dplyr

# Load these libraries
library(tidyverse)
library(stringr)

# Replace 'df' with your actual dataframe

# Transform dataframe to long format
df_long <- df %>%
  pivot_longer(
    cols = -station_id,
    names_to = c(".value", "zone", "bin"),
    names_pattern = "(zone_[a-z])_(bin[0-9])_(.*)",
    names_sep = "_"
  )

# Split habitat column into multiple rows
df_long <- df_long %>%
  separate_rows(habitat, sep = ", ")

# Calculate proportion for each habitat within each station_id and bin combination
df_final <- df_long %>%
  group_by(station_id, bin, habitat) %>%
  summarise(proportion = sum(prop), .groups = 'drop')

为了帮助可视化，如果您打印“df_long”（见上文），它应该如下所示：

   station_id zone   bin   prop habitat          
   <chr>      <chr>  <chr> <dbl> <chr>            
 1 SB_L004    zone_a bin1  0.137 Shrub            
 2 SB_L004    zone_a bin1  0.137 Wetland          
 3 SB_L004    zone_a bin1  0.137 Mixed forest     
 ... more rows

修订我在 https://rdrr.io/snippets/ 尝试过这个，它运行没有错误，但我不确定最终分组是否正确：

# Load libraries
library(tidyverse)
library(stringr)

# Assign the structure to df
df <- structure(list(station_id = c("SB_L004", "SB_L005", "SB_P005", 
"SB_P007", "SB_P008", "SB_P009", "SB_P015"), zone_a_bin1 = c("Shrub, Wetland, Mixed forest", 
"Wetland, Deciduous forest", "Shrub, Wetland, Coniferous forest", 
"Meadow, Wetland, Deciduous forest", "Grass/Lawn", "Grass/Lawn, Wetland", 
"Grass/Lawn, Meadow"), zone_b_bin1 = c("Shrub, Wetland, Mixed forest", 
"Wetland, Deciduous forest", NA, "Mixed forest", NA, NA, "Meadow, Mixed forest"
), zone_c_bin1 = c("Meadow, Wetland, Mixed forest", "Shrub, Wetland, Deciduous forest", 
NA, NA, NA, NA, NA), zone_d_bin1 = c("Meadow, Shrub, Wetland, Coniferous forest", 
NA, NA, NA, NA, NA, NA), zone_a_prop = c(0.1369, 0.2096, 1, 0.898, 
1, 1, 0.5327), zone_b_prop = c(0.2433, 0.4, 0, 0.102, 0, 0, 0.4673
), zone_c_prop = c(0.1109, 0.3904, 0, 0, 0, 0, 0), zone_d_prop = c(0.5089, 
0, 0, 0, 0, 0, 0)), row.names = c(NA, -7L), class = c("tbl_df", 
"tbl", "data.frame"))

# Load libraries
library(tidyverse)
library(stringr)

# Assign the structure to df
# ... (same as before)

# Transform dataframe to long format - step 1 for "bin1" columns
df_long_bin1 <- df %>%
  pivot_longer(
    cols = ends_with("bin1"),
    names_to = "zone",
    names_prefix = "zone_",
    values_to = "habitat"
  ) %>%
  mutate(bin = "bin1")

# Transform dataframe to long format - step 2 for "prop" columns
df_long_prop <- df %>%
  pivot_longer(
    cols = ends_with("prop"),
    names_to = "zone",
    names_prefix = "zone_",
    values_to = "prop"
  ) %>%
  mutate(bin = "prop")

# Join the two dataframes together
df_long <- full_join(df_long_bin1, df_long_prop, by = c("station_id", "zone", "bin"))

# Split habitat column into multiple rows
df_long <- df_long %>%
  separate_rows(habitat, sep = ", ")

# Filter out rows where bin is "prop" and habitat is NA
df_long <- df_long %>%
  filter(!(bin == "prop" & is.na(habitat)))

# Calculate proportion for each habitat within each station_id and zone combination
df_final <- df_long %>%
  group_by(station_id, zone, habitat) %>%
  summarise(proportion = sum(prop, na.rm = TRUE), .groups = 'drop')

# Print the result
print(df_final)

结果看起来像这样，所以你可能需要调整上面的内容。

   station_id zone   habitat           proportion
   <chr>      <chr>  <chr>                  <dbl>
 1 SB_L004    a_bin1 Mixed forest               0
 2 SB_L004    a_bin1 Shrub                      0
 3 SB_L004    a_bin1 Wetland                    0
 4 SB_L004    b_bin1 Mixed forest               0
 5 SB_L004    b_bin1 Shrub                      0
 6 SB_L004    b_bin1 Wetland                    0
 7 SB_L004    c_bin1 Meadow                     0
 8 SB_L004    c_bin1 Mixed forest               0
 9 SB_L004    c_bin1 Wetland                    0
10 SB_L004    d_bin1 Coniferous forest          0
# ... with 38 more rows

底线和重要的信息是，为了避免难以忍受的复杂代码，您需要将 df 从宽/短结构转换为窄/长结构。一旦数据被“规范化”，对数字进行分组/聚合的代码将简单得多。

感谢您的回复，尽管我无法让您的“pivot_longer”代码正常工作：“如果您提供多个名称，您还必须提供其中一个或”。为了尝试完成这项工作，我删除了最后一行（'names_sep'）并编辑了示例数据以具有相似的列名（现在格式为'（zone_[a-z}_（bin[0-9]））_（habitat or prop），错误是：“无法将<字符>和<double>组合在一起。此外，以“_prop”结尾的列适用于同一区域中的每个箱，尽管我可以单独处理每个箱。names_tonames_sepnames_patternzone_a_bin1_habitatzone_a_bin1_prop

0赞 Adrian Maxwell 11/15/2023

还行。我有一个修改后的方法，但请注意，我不是在尝试编写最终代码，只是想指出，一旦你正确地布局了转换，就需要将你的数据从宽到窄（透视到非透视），聚合应该很简单。

1赞 CHW 11/16/2023

非常感谢 - 通过使用这个例子，我能够让一切正常。我通过重命名“区域”列，使它们在“_prop”和“_bin1”透视数据之间匹配，以及其他一些小的编辑，从而防止了全零的最终表。再次感谢！

0赞 Adrian Maxwell 11/16/2023

很高兴听到你得到了一个结果。现在，您将知道要避免聚合的宽布局。:-)

上一个：如何从 AAString 对象中子集 AA？

下一个：如何在 R 中子集 String 对象

在具有相同后缀的列中，如果存在字符串匹配，则求和

Across columns with same suffix, sum if a there is a string match

评论

评论