R:确定一组列中的冗余和唯一值

R: Determine redundancies and unique values in a set of columns

提问人:ltong 提问时间:12/30/2022 最后编辑:ltong 更新时间:12/30/2022 访问量:31

问:

我希望确定一组列中的值何时是多余的,将其记录在一个新列中,其中 0 表示只看到一个值,1 表示看到多个值。当该值与其他值一起使用时,我希望代码忽略它并相应地评估其他值的冗余。当该值是列集中的唯一值时,我希望该列记录 .multi?"Unspecified""Unspecified"multi?"Unspecified"

值得注意的是,这四列只是一个具有更多列的更大数据库的一部分。

为了说明我的意思,我在下面提供了一个示例输入和输出:

  headbleed_type_dx1 headbleed_type_dx2 headbleed_type_dx3 headbleed_type_dx4
1      Intracerebral      Intracerebral      Intracerebral               <NA>      
2      Intracerebral      Subarachnoid                <NA>           Subdural      
3        Unspecified      Intracerebral           Subdural      Intracerebral      
4        Unspecified               <NA>                <NA>               <NA>               
5               <NA>               <NA>                <NA>               <NA>               

如果行是 1 ,那么我还想记录新列中唯一值的数量Multi?Number

  Multi?       Number
1 0            1
2 1            3
3 1            2
4 Unspecified  1
5 NA           NA 
R 数据库 数据帧 冗余

评论


答:

1赞 Martin Gal 12/30/2022 #1

这真的很麻烦,我真的建议不要在一列中混合数字和字符。话虽如此,如果您对基于的解决方案持开放态度dplyr

library(dplyr)

data %>% 
  rowwise() %>% 
  summarise(
    number = n_distinct(
      c_across(headbleed_type_dx1:headbleed_type_dx4), 
      na.rm = TRUE),
    unspec = coalesce(
      any(c_across(headbleed_type_dx1:headbleed_type_dx4) == "Unspecified"), 
      FALSE)) %>% 
  mutate(
    number2 = if_else(number > 1L & unspec, number - 1L, na_if(number, 0)),
    multi = case_when(number == 1 & unspec ~ "Unspecific",
                      number2 == 1 ~ "0",
                      is.na(number2) ~ NA_character_,
                      TRUE ~ "1"),
    .keep = "none") %>% 
  select(number = number2, multi)

这将返回

# A tibble: 6 × 2
  number multi     
   <int> <chr>     
1      1 0         
2      3 1         
3      2 1         
4      1 Unspecific
5     NA NA        
6      1 0       

数据

structure(list(headbleed_type_dx1 = c("Intracerebral", "Intracerebral", 
"Unspecified", "Unspecified", NA, "Intracerebral"), headbleed_type_dx2 = c("Intracerebral", 
"Subarachnoid", "Intracerebral", NA, NA, "Unspecified"), headbleed_type_dx3 = c("Intracerebral", 
NA, "Subdural", NA, NA, "Intracerebral"), headbleed_type_dx4 = c(NA, 
"Subdural", "Intracerebral", NA, NA, NA)), problems = structure(list(
    row = 1:4, col = c(NA_character_, NA_character_, NA_character_, 
    NA_character_), expected = c("4 columns", "4 columns", "4 columns", 
    "4 columns"), actual = c("5 columns", "5 columns", "5 columns", 
    "5 columns"), file = c("literal data", "literal data", "literal data", 
    "literal data")), row.names = c(NA, -4L), class = c("tbl_df", 
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -6L), spec = structure(list(
    cols = list(headbleed_type_dx1 = structure(list(), class = c("collector_character", 
    "collector")), headbleed_type_dx2 = structure(list(), class = c("collector_character", 
    "collector")), headbleed_type_dx3 = structure(list(), class = c("collector_character", 
    "collector")), headbleed_type_dx4 = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1L), class = "col_spec"))

评论

0赞 ltong 12/30/2022
非常感谢!现在,对于这样的行:“Intracerebral” “Unspecific” “Intracerebral” NA,我得到: 数字:1;Multi:“不具体”。有没有办法调整代码,使其为:数字:1;多:0
0赞 Martin Gal 12/30/2022
这确实是一个错误。请参阅更新后的答案。