提问人:Mohammad Haddadi 提问时间:8/13/2023 最后编辑:Mohammad Haddadi 更新时间:8/15/2023 访问量:107
如何根据特定变量匹配 R 中的数据
How match data in R based on specific variable
问:
我有一个样本文件,其中有 2016 年伊朗人口普查中提取的总人口的 2%(来自 7500 万总人口的 150 万个样本)。下面我以 22 个人为例:
sample <- structure(list(household.ID = c(16523634, 16523634, 16523634, 16523634,16525912,
16525912, 16540127,16540127, 16598050, 16598050, 16611764,16611764, 16611764, 16643309,
16643309, 16652356, 16652356,16652356, 16672105, 16672105, 16672105,16672105
),Member.ID= c(16527193, 16529443, 16532250, 16534992,16527527, 16529230,
16542499,16545263, 16616975, 16620223, 16633984,16642611, 16650837, 16646986, 16650210,
16660335, 16665128,16668381, 16676674, 16681528, 16685073,16687491
),Relatshinship= c(1,2,3,3,1,2,1,2,1,3,1,2,3,1,2,1,2,3,1,2,3,3),birth.year=
c(1346,1348,1376,1377,1357,1367,1316,1319,1329,1374,1339,1342,1367,1343,1336 ,1321
,1326,1367,1338,1352,1372,1381),Gender = c(1,2 ,1,2,1,2,1 ,2,1 ,1,1 ,2,1 ,1 ,2 ,1,2 ,2,1
,2,1,1),age = c(49,47,19 ,18,38,28,78,75,66 ,21,56 ,52 ,28,51 ,58 ,74 ,68 ,27 ,56
,43,23 ,13),marriage.stuatus= c(1,1 ,4 ,4 ,1,1,1,1,2,4,1,1 ,1,1 ,1,1,1 ,4,1 ,1,4 ,4),
number.of.children.ever.born= c(NA,2,NA,NA,NA,NA,NA,6,NA,NA,NA,2,NA,NA,3,NA,3,NA,NA,2,NA,NA),
number.of.living.children = c(NA,2,NA,NA,NA,NA,NA,4,NA,NA,NA,2,NA,NA,3,NA,3,NA,NA,2,NA, NA)),
row.names = c(NA, -22L),class = "data.frame")
我想为女性创造一个出生史。为此,我需要将孩子与他们的母亲相匹配。我的数据中有一列提到了个人与户主的关系。代码 1 为户主,2 为户主的妻子,3 为子女,4 为女婿或儿媳,5 为孙子女,6 为父亲或母亲等。例如,在我的数据中,第一个家庭(ID:16523634)有 4 名成员,其中包括一家之主,这里是一个男人(代码:1= 男人,2= 女人)、他的妻子和两个孩子,一个儿子(年龄:19 岁)和一个女儿(年龄:18 岁)。长话短说,我需要将孩子(也出现在数据中)与他们在家庭中的母亲进行匹配,这样我就可以为每个母亲设置列,在列中指定匹配的每个孩子的年龄。我希望我的数据最终达到这样的效果:
H.ID | M.ID | 学士 | 性 | 年龄 | 第一个孩子 | 第二个孩子 | 第三个孩子 |
---|---|---|---|---|---|---|---|
16523634 | 16529443 | 1348 | 2 | 47 | 19 | 18 | 那 |
答:
起始数据
当我从 OP 加载时,数据帧中有两个长度仅为 == 21 的向量,因此我添加了一些矢量只是为了获得有效的起点。这是我使用的:sample
NA
library(tidyverse)
sample <-
tibble(
household.ID = c(
16523634,16523634,16523634,16523634,16525912,16525912,
16540127,16540127,16598050,16598050,16611764,16611764,
16611764,16643309,16643309,16652356,16652356,16652356,
16672105,16672105,16672105,16672105),
Member.ID= c(16527193, 16529443, 16532250, 16534992,16527527, 16529230,
16542499,16545263, 16616975, 16620223, 16633984,16642611,
16650837, 16646986, 16650210, 16660335, 16665128,16668381,
16676674, 16681528, 16685073,16687491),
Relatshinship = c(1,2,3,3,1,2,1,2,1,3,1,2,3,1,2,1,2,3,1,2,3,3),
birth.year = c(1346,1348,1376,1377,1357,1367,1316,1319,1329,1374,1339,
1342,1367,1343,1336 ,1321,1326,1367,1338,1352,1372,1381),
Gender = c(1,2,1,2,1,2,1,2,1,1,1,2,1,1,2,1,2,2,1,2,1,1),
age = c(49,47,19,18,38,28,78,75,66,21,56,52,28,51,58,74,68,27,56,43,23,13),
marriage.stuatus= c(1,1,4,4,1,1,1,1,2,4,1,1,1,1,1,1,1,4,1,1,4,4),
number.of.children.ever.born= c(NA,2,NA,NA,NA,NA,NA,6,NA,NA,NA,2,NA,NA,
3,NA,3,NA,NA,2,NA,NA),
number.of.living.children = c(NA,2,NA,NA,NA,NA,NA,4,NA,NA,NA,2,NA,NA,
3,NA,3,NA,NA,2,NA,NA))
看起来很残酷,但让我们从将母亲和孩子与家人分开开始:
df_mothers <- sample %>%
filter(Relatshinship == 2) %>%
print()
# A tibble: 7 × 9
household.ID Member.ID Relatshinship birth.year Gender age marriage.stuatus number.of.children.ever.born number.of.living.children
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 16523634 16529443 2 1348 2 47 1 2 2
2 16525912 16529230 2 1367 2 28 1 NA NA
3 16540127 16545263 2 1319 2 75 1 6 4
4 16611764 16642611 2 1342 2 52 1 2 2
5 16643309 16650210 2 1336 2 58 1 3 3
6 16652356 16665128 2 1326 2 68 1 3 3
7 16672105 16681528 2 1352 2 43 1 2 2
df_children <- sample %>%
filter(Relatshinship == 3) %>%
group_by(household.ID) %>%
arrange(household.ID,desc(age)) %>%
mutate(birth.order = ordinal(row_number())) %>%
select(-c(marriage.stuatus, number.of.children.ever.born, number.of.living.children)) %>%
print()
# A tibble: 7 × 7
# Groups: household.ID [5]
household.ID Member.ID Relatshinship birth.year Gender age birth.order
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 16523634 16532250 3 1376 1 19 1st
2 16523634 16534992 3 1377 2 18 2nd
3 16598050 16620223 3 1374 1 21 1st
4 16611764 16650837 3 1367 1 28 1st
5 16652356 16668381 3 1367 2 27 1st
6 16672105 16685073 3 1372 1 23 1st
7 16672105 16687491 3 1381 1 13 2nd
好吧,这让你把孩子按家庭分组并显示出生顺序,但你想把孩子挤在每户一排(就像我的姐妹们挤在一间卧室里一样),所以为了救援:pivot_wider()
df_children_pivot <- df_children %>%
pivot_wider(id_cols = household.ID,
names_from = birth.order,
names_glue = "{birth.order}_born",
values_from = age) %>%
print()
# A tibble: 5 × 3
# Groups: household.ID [5]
household.ID `1st_born` `2nd_born`
<dbl> <dbl> <dbl>
1 16523634 19 18
2 16598050 21 NA
3 16611764 28 NA
4 16652356 27 NA
5 16672105 23 13
而现在,因为我们的残忍只能到此为止,让我们把我们从他们家里扯下来的孩子塞成一排,现在我们将让他们与他们的母亲团聚:
df_reunited <- df_mothers %>%
left_join(df_children_pivot, by = "household.ID") %>%
select(household.ID,Member.ID,birth.year,Gender,age,`1st_born`:last_col()) %>%
print()
# A tibble: 7 × 7
household.ID Member.ID birth.year Gender age `1st_born` `2nd_born`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 16523634 16529443 1348 2 47 19 18
2 16525912 16529230 1367 2 28 NA NA
3 16540127 16545263 1319 2 75 NA NA
4 16611764 16642611 1342 2 52 28 NA
5 16643309 16650210 1336 2 58 NA NA
6 16652356 16665128 1326 2 68 27 NA
7 16672105 16681528 1352 2 43 23 13
现在,我在上面假设这是合适的,并且所有孩子都有母亲在场,但我想你可能会有没有母亲的孤儿,所以根据实际数据,你可能需要调整你使用什么样的加入。但我认为这就是你想要的。left_join()
下一个:使用 %in% 与列表进行匹配
评论