根据变量名称的第一个字母将数据重整为长格式

Reshape data to long format based on the first letter of the variable names

提问人:cliu 提问时间:4/11/2021 最后编辑:cliu 更新时间:4/12/2021 访问量:129

问:

我正在尝试根据变量名称的第一个字母将数据重塑为长格式。我有来自母亲和父亲的数据,它们由变量的第一个字母表示,如以下数据集所示:

toydat <- data.frame(id=1:10,
           mincome=rep(sample(1:5), 2),
           medu=rep(sample(1:5), 2),
           methnicity=rep(sample(1:5), 2),
           fincome=rep(sample(1:5), 2),
           fedu=rep(sample(1:5), 2),
           fethnicity=rep(sample(1:5), 2)
)

最终,数据应如下所示

 gender income   edu ethnicity 
 mother      3     4         3
 mother      2     2         4
 mother      5     3         2
 mother      3     4         2
 mother      4     3         3
 mother      2     2         1
 mother      3     3         4
 mother      4     4         4
 mother      3     3         5
 mother      2     2         1
 father      5     5         2
 father      3     3         3
 father      4     2         2
 father      2     2         4
 father      3     1         5
 father      4     4         1
 father      4     5         2
 father      3     2         3
 father      3     3         2
 father      1     2         1

任何帮助将不胜感激!

编辑多亏了@akrun,我原来的问题才得到解决。我想知道如果性别指标或在名字的末尾怎么办。如何以正则表达式的方式?mfnames_sep

通过尝试以下代码,尽管创建了 gender 变量,但不会拆分变量。

toydat %>% 
     select(-id) %>% 
     pivot_longer(cols = everything(), 
                  names_to = c(".value", "gender"), 
                  names_sep = "(<=[a-z])(?=[mf]$)") %>%
     mutate(gender = case_when(gender == 'm' ~ 'mother', TRUE ~ 'father'))
# A tibble: 10 x 7
   gender mincome  medu methnicity fincome  fedu fethnicity
   <chr>    <int> <int>      <int>   <int> <int>      <int>
 1 father       1     3          4       5     5          5
 2 father       5     4          3       3     1          4
 3 father       3     2          2       1     4          2
 4 father       2     1          1       4     2          1
 5 father       4     5          5       2     3          3
 6 father       1     3          4       5     5          5
 7 father       5     4          3       3     1          4
 8 father       3     2          2       1     4          2
 9 father       2     1          1       4     2          1
10 father       4     5          5       2     3          3
R 重塑 数据操作

评论


答:

3赞 akrun 4/11/2021 #1

我们删除“id”列,然后将所有列转换为长格式,指定 p 以在字符串开头 () 处的“m”或“f”和正则表达式环视中的下一个字母之间拆分,然后通过将“m”更改为“mother”和“f”更改为“father”来重新编码“gender”列names_se^case_when

library(dplyr)
library(tidyr)
toydat %>% 
   select(-id) %>% 
   pivot_longer(cols = everything(), 
     names_to = c("gender", ".value"), 
        names_sep = "(?<=^[mf])(?=[a-z])") %>%
   mutate(gender = case_when(gender == 'm' ~ 'mother', TRUE ~ 'father'))

-输出

# A tibble: 20 x 4
#   gender income   edu ethnicity
#   <chr>   <int> <int>     <int>
# 1 mother      3     5         3
# 2 father      4     5         5
# 3 mother      4     3         5
# 4 father      3     1         1
# 5 mother      2     1         2
# 6 father      2     3         3
# 7 mother      1     2         1
# 8 father      5     2         4
# 9 mother      5     4         4
#10 father      1     4         2
#11 mother      3     5         3
#12 father      4     5         5
#13 mother      4     3         5
#14 father      3     1         1
#15 mother      2     1         2
#16 father      2     3         3
#17 mother      1     2         1
#18 father      5     2         4
#19 mother      5     4         4
#20 father      1     4         2

输出值与预期值不同,因为在构造输入示例时使用的 OP 没有sampleset.seed


对于编辑的部分,我们也切换并更改了正则表达式环绕names_tonames_sep

# // change the column names by rearranging the 'm|f'
# // at the end of the column name
names(toydat)[-1] <- sub("^(.)(.*)", "\\2\\1", names(toydat)[-1]) 
toydat %>% 
   select(-id) %>% 
   pivot_longer(cols = everything(), 
       names_to = c(".value", "gender"), 
              names_sep = "(?<=[a-z])(?=[mf]$)") %>%
       mutate(gender = case_when(gender == 'm' ~ 'mother', TRUE ~ 'father'))

-输出

# A tibble: 20 x 4
#   gender income   edu ethnicity
#   <chr>   <int> <int>     <int>
# 1 mother      1     2         1
# 2 father      5     5         1
# 3 mother      5     4         3
# 4 father      4     4         2
# 5 mother      3     3         4
# 6 father      2     2         4
# 7 mother      4     5         2
# 8 father      3     1         3
# 9 mother      2     1         5
#10 father      1     3         5
#11 mother      1     2         1
#12 father      5     5         1
#13 mother      5     4         3
#14 father      4     4         2
#15 mother      3     3         4
#16 father      2     2         4
#17 mother      4     5         2
#18 father      3     1         3
#19 mother      2     1         5
#20 father      1     3         5

评论

1赞 cliu 4/11/2021
谢谢你@akrun!
0赞 cliu 4/11/2021
嗨,@akrun,只是一个后续问题:如果在名字的末尾怎么办?如何以正则表达式的方式?mfnames_sep
0赞 akrun 4/12/2021
@cliu 在这种情况下,您确实需要并且还需要更改names_sep = "(<=[a-z])(?=[mf]$)")names_to = c( ".value", "gender)
0赞 cliu 4/12/2021
感谢@akrun的代码。我试过了,但变量没有被拆分。请参阅我对问题的编辑
1赞 akrun 4/12/2021
@cliu 如果您检查我的更新,它正在修复该错别字