在 R 中将多个 0/1 变量转换为一个变量 [duplicate]

Turning multiple 0/1 variables into one in R [duplicate]

提问人:Dubhe 提问时间:11/17/2023 最后编辑:jpsmithDubhe 更新时间:11/17/2023 访问量:61

问:

我有一个基于问卷的数据帧。对于一个问题(最高教育程度),它使用十几个二进制变量,每个可能的答案一个。如何将这些二进制变量转换为仅包含最高度名称的新变量?(二进制变量似乎是互斥的,不知道如何有效地检查接近 38k 个实例)

我在谷歌上搜索了一个解决方案,但找不到任何真正试图通过命名最高度而不是简单地将所有变量加在一起来将这些多个二进制文件变成一个变量的东西

r

评论

1赞 jpsmith 11/17/2023
解决这个问题的前提是确认这些列是否相互排斥。您可以通过跨所需列使用并查看是否有任何行的值为两个或更多来检查这一点。您可能还希望编辑您的问题以包含可重现的样本数据,以获得更好、更快的帮助。祝你好运!rowSums
1赞 Andrew Gustar 11/17/2023
如果它们是互斥的,则可以将数据转为长格式,然后进行筛选以仅保留 s。1

答:

0赞 Léa 11/17/2023 #1

使用 tidyverse,这里有两种可能的解决方案,具体取决于数据是否是排他性的。

如果它们是互斥的

library(tidyverse)
set.seed(123)

df_exclusive = data.frame("Id"=1:100) %>%
  rowwise() %>% 
  mutate(
    "HighDegree.Niv1" = sample(c(0, 1), size = 1), 
    "HighDegree.Niv2" = ifelse(!HighDegree.Niv1, sample(c(0, 1), size = 1), 0),
    "HighDegree.Niv3" = ifelse(!(HighDegree.Niv1 | HighDegree.Niv2), sample(c(0, 1), size = 1), 0),
    "HighDegree.Niv4" = ifelse(!(HighDegree.Niv1 | HighDegree.Niv2 | HighDegree.Niv3), sample(c(0, 1), size = 1), 0),
    "HighDegree.Niv5" = ifelse(!(HighDegree.Niv1 | HighDegree.Niv2 | HighDegree.Niv3 | HighDegree.Niv4), 1, 0)
  )
rowSums(df_exclusive[,-1]) # Exclusive: only 1 high degree per Id


df_exclusive %>%
  pivot_longer(cols = !Id,
               names_to = "HighDegree",
               names_prefix = "HighDegree.") %>%
  filter(value == 1)

这也是@andrew-gustar提出的解决方案。

如果它们不是相互排斥的

library(tidyverse)
set.seed(123)

df_nonexclusive = matrix(
  sample(c(0, 1), size = 100 * 5, replace = T),
  ncol = 5,
  nrow = 100,
  dimnames = list(1:100, paste0("HighDegree.Niv", 1:5))
) %>% 
  as.data.frame() %>% 
  rownames_to_column("Id")
rowSums(df_nonexclusive[,-1]) # Not exclusive: more than 1 high degree per Id


df_nonexclusive %>%
  pivot_longer(cols = !Id,
               names_to = "HighDegree",
               names_prefix = "HighDegree.") %>% 
  mutate(HighDegreeOrdered = factor(
    ifelse(value, HighDegree, 0),
    levels = c(0, "Niv1", "Niv2", "Niv3", "Niv4", "Niv5"),
    ordered = T
  )) %>% 
  group_by(Id) %>%
  summarise("HigherDegree" = max(HighDegreeOrdered))

在这种情况下,以下是有关代码的更多详细信息:

我首先模拟数据,因为您不包括可重现的数据样本。前两行如下所示:df_nonexclusive

     Id HighDegree.Niv1 HighDegree.Niv2 HighDegree.Niv3 HighDegree.Niv4 HighDegree.Niv5
1     1               0               0               1               0               1
2     2               0               1               0               1               0

我用来透视数据。数据帧现在如下所示:pivot_longer

# A tibble: 500 × 3
      Id HighDegree value
   <int> <chr>      <dbl>
 1     1 Niv1           0
 2     1 Niv2           0
 3     1 Niv3           1
 4     1 Niv4           0
 5     1 Niv5           0
 6     2 Niv1           1
 7     2 Niv2           0
 8     2 Niv3           0
 9     2 Niv4           0
10     2 Niv5           0
# ℹ 490 more rows

我创建了一个类“有序因子”的新列。否则,如果为 1,则为 0。它是根据学位级别排序的。它看起来像这样:HighDegreevalue

# A tibble: 500 × 4
   Id    HighDegree value HighDegreeOrdered
   <chr> <chr>      <dbl> <ord>            
 1 1     Niv1           0 0                
 2 1     Niv2           0 0                
 3 1     Niv3           1 Niv3             
 4 1     Niv4           0 0                
 5 1     Niv5           1 Niv5             
 6 2     Niv1           0 0                
 7 2     Niv2           1 Niv2             
 8 2     Niv3           0 0                
 9 2     Niv4           1 Niv4             
10 2     Niv5           0 0                
# ℹ 490 more rows

最后,我通过在列中保留最大值来分组和总结。我最终得到这个结果:IdHighDegreeOrdered

# A tibble: 100 × 2
   Id    HigherDegree
   <chr> <ord>       
 1 1     Niv5        
 2 10    Niv3        
 3 100   Niv4        
 4 11    Niv4        
 5 12    Niv3        
 6 13    Niv5        
 7 14    Niv5        
 8 15    Niv4        
 9 16    Niv4        
10 17    Niv5        
# ℹ 90 more rows