提问人:ltong 提问时间:12/30/2022 最后编辑:ltong 更新时间:12/30/2022 访问量:31
R:确定一组列中的冗余和唯一值
R: Determine redundancies and unique values in a set of columns
问:
我希望确定一组列中的值何时是多余的,将其记录在一个新列中,其中 0 表示只看到一个值,1 表示看到多个值。当该值与其他值一起使用时,我希望代码忽略它并相应地评估其他值的冗余。当该值是列集中的唯一值时,我希望该列记录 .multi?
"Unspecified"
"Unspecified"
multi?
"Unspecified"
值得注意的是,这四列只是一个具有更多列的更大数据库的一部分。
为了说明我的意思,我在下面提供了一个示例输入和输出:
headbleed_type_dx1 headbleed_type_dx2 headbleed_type_dx3 headbleed_type_dx4
1 Intracerebral Intracerebral Intracerebral <NA>
2 Intracerebral Subarachnoid <NA> Subdural
3 Unspecified Intracerebral Subdural Intracerebral
4 Unspecified <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA>
如果行是 1 ,那么我还想记录新列中唯一值的数量Multi?
Number
Multi? Number
1 0 1
2 1 3
3 1 2
4 Unspecified 1
5 NA NA
答:
1赞
Martin Gal
12/30/2022
#1
这真的很麻烦,我真的建议不要在一列中混合数字和字符。话虽如此,如果您对基于的解决方案持开放态度dplyr
library(dplyr)
data %>%
rowwise() %>%
summarise(
number = n_distinct(
c_across(headbleed_type_dx1:headbleed_type_dx4),
na.rm = TRUE),
unspec = coalesce(
any(c_across(headbleed_type_dx1:headbleed_type_dx4) == "Unspecified"),
FALSE)) %>%
mutate(
number2 = if_else(number > 1L & unspec, number - 1L, na_if(number, 0)),
multi = case_when(number == 1 & unspec ~ "Unspecific",
number2 == 1 ~ "0",
is.na(number2) ~ NA_character_,
TRUE ~ "1"),
.keep = "none") %>%
select(number = number2, multi)
这将返回
# A tibble: 6 × 2
number multi
<int> <chr>
1 1 0
2 3 1
3 2 1
4 1 Unspecific
5 NA NA
6 1 0
数据
structure(list(headbleed_type_dx1 = c("Intracerebral", "Intracerebral",
"Unspecified", "Unspecified", NA, "Intracerebral"), headbleed_type_dx2 = c("Intracerebral",
"Subarachnoid", "Intracerebral", NA, NA, "Unspecified"), headbleed_type_dx3 = c("Intracerebral",
NA, "Subdural", NA, NA, "Intracerebral"), headbleed_type_dx4 = c(NA,
"Subdural", "Intracerebral", NA, NA, NA)), problems = structure(list(
row = 1:4, col = c(NA_character_, NA_character_, NA_character_,
NA_character_), expected = c("4 columns", "4 columns", "4 columns",
"4 columns"), actual = c("5 columns", "5 columns", "5 columns",
"5 columns"), file = c("literal data", "literal data", "literal data",
"literal data")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -6L), spec = structure(list(
cols = list(headbleed_type_dx1 = structure(list(), class = c("collector_character",
"collector")), headbleed_type_dx2 = structure(list(), class = c("collector_character",
"collector")), headbleed_type_dx3 = structure(list(), class = c("collector_character",
"collector")), headbleed_type_dx4 = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
评论
0赞
ltong
12/30/2022
非常感谢!现在,对于这样的行:“Intracerebral” “Unspecific” “Intracerebral” NA,我得到: 数字:1;Multi:“不具体”。有没有办法调整代码,使其为:数字:1;多:0
0赞
Martin Gal
12/30/2022
这确实是一个错误。请参阅更新后的答案。
评论