提问人:CHW 提问时间:11/11/2023 最后编辑:r2evansCHW 更新时间:11/15/2023 访问量:36
在具有相同后缀的列中,如果存在字符串匹配,则求和
Across columns with same suffix, sum if a there is a string match
问:
我有一个湖泊调查数据集:
structure(list(station_id = c("SB_L004", "SB_L005", "SB_P005",
"SB_P007", "SB_P008", "SB_P009", "SB_P015"), zone_a_bin1 = c("Shrub, Wetland, Mixed forest",
"Wetland, Deciduous forest", "Shrub, Wetland, Coniferous forest",
"Meadow, Wetland, Deciduous forest", "Grass/Lawn", "Grass/Lawn, Wetland",
"Grass/Lawn, Meadow"), zone_b_bin1 = c("Shrub, Wetland, Mixed forest",
"Wetland, Deciduous forest", NA, "Mixed forest", NA, NA, "Meadow, Mixed forest"
), zone_c_bin1 = c("Meadow, Wetland, Mixed forest", "Shrub, Wetland, Deciduous forest",
NA, NA, NA, NA, NA), zone_d_bin1 = c("Meadow, Shrub, Wetland, Coniferous forest",
NA, NA, NA, NA, NA, NA), zone_a_prop = c(0.1369, 0.2096, 1, 0.898,
1, 1, 0.5327), zone_b_prop = c(0.2433, 0.4, 0, 0.102, 0, 0, 0.4673
), zone_c_prop = c(0.1109, 0.3904, 0, 0, 0, 0, 0), zone_d_prop = c(0.5089,
0, 0, 0, 0, 0, 0)), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
下面是一个子集:
station_id | zone_a_bin1 | zone_b_bin1 | zone_a_prop | zone_b_prop |
---|---|---|---|---|
SB_P007 | 草甸, 湿地, 落叶林 | 混交林 | 0.898 | 0.102 |
SB_P015 | 草地/草坪, 草地 | 草甸, 混交林 | 0.533 | 0.467 |
湖泊被划分为不同长度的海岸线区域,因此每个区域对应于总海岸线的一部分(以“prop”结尾的列)。该数据集的一部分包括记录为逗号分隔值的分类栖息地变量,其中列名表示区域和距离海岸的距离。
对于每个距离条柱,我正在尝试计算包含每种栖息地类型的海岸线比例,并针对区域比例进行校正。使用上面的子集,我正在寻找如下输出:
station_id | meadow_bin1 |
---|---|
SB_P007 | 0.898 |
SB_P015 | 1.0 |
对于每个栖息地类型-距离图格组合。
我用 、 和 制作了新列:mutate
if_else
str_detect
dat <- dat %>%
mutate(
meadow_bin1 =
(if_else(str_detect(zone_a_bin1, regex("Meadow", ignore_case = TRUE)), zone_a_prop, 0, missing = 0) +
if_else(str_detect(zone_b_bin1, regex("Meadow", ignore_case = TRUE)), zone_b_prop, 0, missing = 0) +
if_else(str_detect(zone_c_bin1, regex("Meadow", ignore_case = TRUE)), zone_c_prop, 0, missing = 0) +
if_else(str_detect(zone_d_bin1, regex("Meadow", ignore_case = TRUE)), zone_d_prop, 0, missing = 0)
)
)
但是最多有 8 个区域、至少 8 种栖息地类型和三个距离箱,因此上述计划很快就会变得笨拙。必须有更好的方法,也许在距离箱后缀匹配的列中匹配感兴趣的刺痛(上面示例中的“_bin1草地”),如果存在字符串,则将区域比例相加?
答:
0赞
Adrian Maxwell
11/11/2023
#1
如果你将数据集的结构从“短&宽”改为“长&窄”,问题就会变得简单得多。即,多做一行,少做列,通过这种方式转置数据,你的计算变得更容易。
# Load these libraries
library(tidyverse)
library(stringr)
# Replace 'df' with your actual dataframe
# Transform dataframe to long format
df_long <- df %>%
pivot_longer(
cols = -station_id,
names_to = c(".value", "zone", "bin"),
names_pattern = "(zone_[a-z])_(bin[0-9])_(.*)",
names_sep = "_"
)
# Split habitat column into multiple rows
df_long <- df_long %>%
separate_rows(habitat, sep = ", ")
# Calculate proportion for each habitat within each station_id and bin combination
df_final <- df_long %>%
group_by(station_id, bin, habitat) %>%
summarise(proportion = sum(prop), .groups = 'drop')
为了帮助可视化,如果您打印“df_long”(见上文),它应该如下所示:
station_id zone bin prop habitat
<chr> <chr> <chr> <dbl> <chr>
1 SB_L004 zone_a bin1 0.137 Shrub
2 SB_L004 zone_a bin1 0.137 Wetland
3 SB_L004 zone_a bin1 0.137 Mixed forest
... more rows
修订我在 https://rdrr.io/snippets/ 尝试过这个,它运行没有错误,但我不确定最终分组是否正确:
# Load libraries
library(tidyverse)
library(stringr)
# Assign the structure to df
df <- structure(list(station_id = c("SB_L004", "SB_L005", "SB_P005",
"SB_P007", "SB_P008", "SB_P009", "SB_P015"), zone_a_bin1 = c("Shrub, Wetland, Mixed forest",
"Wetland, Deciduous forest", "Shrub, Wetland, Coniferous forest",
"Meadow, Wetland, Deciduous forest", "Grass/Lawn", "Grass/Lawn, Wetland",
"Grass/Lawn, Meadow"), zone_b_bin1 = c("Shrub, Wetland, Mixed forest",
"Wetland, Deciduous forest", NA, "Mixed forest", NA, NA, "Meadow, Mixed forest"
), zone_c_bin1 = c("Meadow, Wetland, Mixed forest", "Shrub, Wetland, Deciduous forest",
NA, NA, NA, NA, NA), zone_d_bin1 = c("Meadow, Shrub, Wetland, Coniferous forest",
NA, NA, NA, NA, NA, NA), zone_a_prop = c(0.1369, 0.2096, 1, 0.898,
1, 1, 0.5327), zone_b_prop = c(0.2433, 0.4, 0, 0.102, 0, 0, 0.4673
), zone_c_prop = c(0.1109, 0.3904, 0, 0, 0, 0, 0), zone_d_prop = c(0.5089,
0, 0, 0, 0, 0, 0)), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
# Load libraries
library(tidyverse)
library(stringr)
# Assign the structure to df
# ... (same as before)
# Transform dataframe to long format - step 1 for "bin1" columns
df_long_bin1 <- df %>%
pivot_longer(
cols = ends_with("bin1"),
names_to = "zone",
names_prefix = "zone_",
values_to = "habitat"
) %>%
mutate(bin = "bin1")
# Transform dataframe to long format - step 2 for "prop" columns
df_long_prop <- df %>%
pivot_longer(
cols = ends_with("prop"),
names_to = "zone",
names_prefix = "zone_",
values_to = "prop"
) %>%
mutate(bin = "prop")
# Join the two dataframes together
df_long <- full_join(df_long_bin1, df_long_prop, by = c("station_id", "zone", "bin"))
# Split habitat column into multiple rows
df_long <- df_long %>%
separate_rows(habitat, sep = ", ")
# Filter out rows where bin is "prop" and habitat is NA
df_long <- df_long %>%
filter(!(bin == "prop" & is.na(habitat)))
# Calculate proportion for each habitat within each station_id and zone combination
df_final <- df_long %>%
group_by(station_id, zone, habitat) %>%
summarise(proportion = sum(prop, na.rm = TRUE), .groups = 'drop')
# Print the result
print(df_final)
结果看起来像这样,所以你可能需要调整上面的内容。
station_id zone habitat proportion
<chr> <chr> <chr> <dbl>
1 SB_L004 a_bin1 Mixed forest 0
2 SB_L004 a_bin1 Shrub 0
3 SB_L004 a_bin1 Wetland 0
4 SB_L004 b_bin1 Mixed forest 0
5 SB_L004 b_bin1 Shrub 0
6 SB_L004 b_bin1 Wetland 0
7 SB_L004 c_bin1 Meadow 0
8 SB_L004 c_bin1 Mixed forest 0
9 SB_L004 c_bin1 Wetland 0
10 SB_L004 d_bin1 Coniferous forest 0
# ... with 38 more rows
底线和重要的信息是,为了避免难以忍受的复杂代码,您需要将 df 从宽/短结构转换为窄/长结构。一旦数据被“规范化”,对数字进行分组/聚合的代码将简单得多。
评论
0赞
CHW
11/15/2023
感谢您的回复,尽管我无法让您的“pivot_longer”代码正常工作:“如果您提供多个名称,您还必须提供其中一个或”。为了尝试完成这项工作,我删除了最后一行('names_sep')并编辑了示例数据以具有相似的列名(现在格式为'(zone_[a-z}_(bin[0-9]))_(habitat or prop),错误是:“无法将<字符>和<double>组合在一起。此外,以“_prop”结尾的列适用于同一区域中的每个箱,尽管我可以单独处理每个箱。names_to
names_sep
names_pattern
zone_a_bin1_habitat
zone_a_bin1_prop
0赞
Adrian Maxwell
11/15/2023
还行。我有一个修改后的方法,但请注意,我不是在尝试编写最终代码,只是想指出,一旦你正确地布局了转换,就需要将你的数据从宽到窄(透视到非透视),聚合应该很简单。
1赞
CHW
11/16/2023
非常感谢 - 通过使用这个例子,我能够让一切正常。我通过重命名“区域”列,使它们在“_prop”和“_bin1”透视数据之间匹配,以及其他一些小的编辑,从而防止了全零的最终表。再次感谢!
0赞
Adrian Maxwell
11/16/2023
很高兴听到你得到了一个结果。现在,您将知道要避免聚合的宽布局。:-)
评论