提问人:Dubhe 提问时间:11/17/2023 最后编辑:jpsmithDubhe 更新时间:11/17/2023 访问量:61
在 R 中将多个 0/1 变量转换为一个变量 [duplicate]
Turning multiple 0/1 variables into one in R [duplicate]
问:
我有一个基于问卷的数据帧。对于一个问题(最高教育程度),它使用十几个二进制变量,每个可能的答案一个。如何将这些二进制变量转换为仅包含最高度名称的新变量?(二进制变量似乎是互斥的,不知道如何有效地检查接近 38k 个实例)
我在谷歌上搜索了一个解决方案,但找不到任何真正试图通过命名最高度而不是简单地将所有变量加在一起来将这些多个二进制文件变成一个变量的东西
答:
0赞
Léa
11/17/2023
#1
使用 tidyverse,这里有两种可能的解决方案,具体取决于数据是否是排他性的。
如果它们是互斥的
library(tidyverse)
set.seed(123)
df_exclusive = data.frame("Id"=1:100) %>%
rowwise() %>%
mutate(
"HighDegree.Niv1" = sample(c(0, 1), size = 1),
"HighDegree.Niv2" = ifelse(!HighDegree.Niv1, sample(c(0, 1), size = 1), 0),
"HighDegree.Niv3" = ifelse(!(HighDegree.Niv1 | HighDegree.Niv2), sample(c(0, 1), size = 1), 0),
"HighDegree.Niv4" = ifelse(!(HighDegree.Niv1 | HighDegree.Niv2 | HighDegree.Niv3), sample(c(0, 1), size = 1), 0),
"HighDegree.Niv5" = ifelse(!(HighDegree.Niv1 | HighDegree.Niv2 | HighDegree.Niv3 | HighDegree.Niv4), 1, 0)
)
rowSums(df_exclusive[,-1]) # Exclusive: only 1 high degree per Id
df_exclusive %>%
pivot_longer(cols = !Id,
names_to = "HighDegree",
names_prefix = "HighDegree.") %>%
filter(value == 1)
这也是@andrew-gustar提出的解决方案。
如果它们不是相互排斥的
library(tidyverse)
set.seed(123)
df_nonexclusive = matrix(
sample(c(0, 1), size = 100 * 5, replace = T),
ncol = 5,
nrow = 100,
dimnames = list(1:100, paste0("HighDegree.Niv", 1:5))
) %>%
as.data.frame() %>%
rownames_to_column("Id")
rowSums(df_nonexclusive[,-1]) # Not exclusive: more than 1 high degree per Id
df_nonexclusive %>%
pivot_longer(cols = !Id,
names_to = "HighDegree",
names_prefix = "HighDegree.") %>%
mutate(HighDegreeOrdered = factor(
ifelse(value, HighDegree, 0),
levels = c(0, "Niv1", "Niv2", "Niv3", "Niv4", "Niv5"),
ordered = T
)) %>%
group_by(Id) %>%
summarise("HigherDegree" = max(HighDegreeOrdered))
在这种情况下,以下是有关代码的更多详细信息:
我首先模拟数据,因为您不包括可重现的数据样本。前两行如下所示:df_nonexclusive
Id HighDegree.Niv1 HighDegree.Niv2 HighDegree.Niv3 HighDegree.Niv4 HighDegree.Niv5
1 1 0 0 1 0 1
2 2 0 1 0 1 0
我用来透视数据。数据帧现在如下所示:pivot_longer
# A tibble: 500 × 3
Id HighDegree value
<int> <chr> <dbl>
1 1 Niv1 0
2 1 Niv2 0
3 1 Niv3 1
4 1 Niv4 0
5 1 Niv5 0
6 2 Niv1 1
7 2 Niv2 0
8 2 Niv3 0
9 2 Niv4 0
10 2 Niv5 0
# ℹ 490 more rows
我创建了一个类“有序因子”的新列。否则,如果为 1,则为 0。它是根据学位级别排序的。它看起来像这样:HighDegree
value
# A tibble: 500 × 4
Id HighDegree value HighDegreeOrdered
<chr> <chr> <dbl> <ord>
1 1 Niv1 0 0
2 1 Niv2 0 0
3 1 Niv3 1 Niv3
4 1 Niv4 0 0
5 1 Niv5 1 Niv5
6 2 Niv1 0 0
7 2 Niv2 1 Niv2
8 2 Niv3 0 0
9 2 Niv4 1 Niv4
10 2 Niv5 0 0
# ℹ 490 more rows
最后,我通过在列中保留最大值来分组和总结。我最终得到这个结果:Id
HighDegreeOrdered
# A tibble: 100 × 2
Id HigherDegree
<chr> <ord>
1 1 Niv5
2 10 Niv3
3 100 Niv4
4 11 Niv4
5 12 Niv3
6 13 Niv5
7 14 Niv5
8 15 Niv4
9 16 Niv4
10 17 Niv5
# ℹ 90 more rows
评论
rowSums
1