如何根据特定变量匹配 R 中的数据

How match data in R based on specific variable

提问人:Mohammad Haddadi 提问时间:8/13/2023 最后编辑:Mohammad Haddadi 更新时间:8/15/2023 访问量:107

问:

我有一个样本文件,其中有 2016 年伊朗人口普查中提取的总人口的 2%(来自 7500 万总人口的 150 万个样本)。下面我以 22 个人为例:

sample <- structure(list(household.ID = c(16523634, 16523634, 16523634, 16523634,16525912, 
    16525912, 16540127,16540127, 16598050, 16598050, 16611764,16611764, 16611764, 16643309, 
    16643309, 16652356, 16652356,16652356, 16672105, 16672105, 16672105,16672105
    ),Member.ID= c(16527193, 16529443, 16532250, 16534992,16527527, 16529230, 
    16542499,16545263, 16616975, 16620223, 16633984,16642611, 16650837, 16646986, 16650210, 
    16660335, 16665128,16668381, 16676674, 16681528, 16685073,16687491
    ),Relatshinship= c(1,2,3,3,1,2,1,2,1,3,1,2,3,1,2,1,2,3,1,2,3,3),birth.year= 
    c(1346,1348,1376,1377,1357,1367,1316,1319,1329,1374,1339,1342,1367,1343,1336 ,1321 
     ,1326,1367,1338,1352,1372,1381),Gender  = c(1,2 ,1,2,1,2,1 ,2,1 ,1,1 ,2,1 ,1 ,2 ,1,2 ,2,1 
     ,2,1,1),age    = c(49,47,19 ,18,38,28,78,75,66 ,21,56 ,52 ,28,51 ,58 ,74 ,68 ,27  ,56 
    ,43,23 ,13),marriage.stuatus= c(1,1 ,4 ,4 ,1,1,1,1,2,4,1,1 ,1,1 ,1,1,1 ,4,1 ,1,4 ,4),   
    number.of.children.ever.born= c(NA,2,NA,NA,NA,NA,NA,6,NA,NA,NA,2,NA,NA,3,NA,3,NA,NA,2,NA,NA),    
    number.of.living.children  = c(NA,2,NA,NA,NA,NA,NA,4,NA,NA,NA,2,NA,NA,3,NA,3,NA,NA,2,NA, NA)), 
    row.names = c(NA, -22L),class = "data.frame")

我想为女性创造一个出生史。为此,我需要将孩子与他们的母亲相匹配。我的数据中有一列提到了个人与户主的关系。代码 1 为户主,2 为户主的妻子,3 为子女,4 为女婿或儿媳,5 为孙子女,6 为父亲或母亲等。例如,在我的数据中,第一个家庭(ID:16523634)有 4 名成员,其中包括一家之主,这里是一个男人(代码:1= 男人,2= 女人)、他的妻子和两个孩子,一个儿子(年龄:19 岁)和一个女儿(年龄:18 岁)。长话短说,我需要将孩子(也出现在数据中)与他们在家庭中的母亲进行匹配,这样我就可以为每个母亲设置列,在列中指定匹配的每个孩子的年龄。我希望我的数据最终达到这样的效果:

H.ID M.ID 学士 年龄 第一个孩子 第二个孩子 第三个孩子
16523634 16529443 1348 2 47 19 18
R 匹配

评论


答:

2赞 ScottyJ 8/14/2023 #1

起始数据

当我从 OP 加载时,数据帧中有两个长度仅为 == 21 的向量,因此我添加了一些矢量只是为了获得有效的起点。这是我使用的:sampleNA

library(tidyverse)

sample <-
  tibble(
    household.ID = c(
      16523634,16523634,16523634,16523634,16525912,16525912,
      16540127,16540127,16598050,16598050,16611764,16611764,
      16611764,16643309,16643309,16652356,16652356,16652356,
      16672105,16672105,16672105,16672105),
    Member.ID= c(16527193, 16529443, 16532250, 16534992,16527527, 16529230, 
                 16542499,16545263, 16616975, 16620223, 16633984,16642611, 
                 16650837, 16646986, 16650210, 16660335, 16665128,16668381, 
                 16676674, 16681528, 16685073,16687491),
    Relatshinship = c(1,2,3,3,1,2,1,2,1,3,1,2,3,1,2,1,2,3,1,2,3,3),
    birth.year = c(1346,1348,1376,1377,1357,1367,1316,1319,1329,1374,1339,
                   1342,1367,1343,1336 ,1321,1326,1367,1338,1352,1372,1381),
    Gender = c(1,2,1,2,1,2,1,2,1,1,1,2,1,1,2,1,2,2,1,2,1,1),
    age = c(49,47,19,18,38,28,78,75,66,21,56,52,28,51,58,74,68,27,56,43,23,13),
  marriage.stuatus= c(1,1,4,4,1,1,1,1,2,4,1,1,1,1,1,1,1,4,1,1,4,4),
  number.of.children.ever.born= c(NA,2,NA,NA,NA,NA,NA,6,NA,NA,NA,2,NA,NA,
                                  3,NA,3,NA,NA,2,NA,NA),    
  number.of.living.children  = c(NA,2,NA,NA,NA,NA,NA,4,NA,NA,NA,2,NA,NA,
                                 3,NA,3,NA,NA,2,NA,NA))

看起来很残酷,但让我们从将母亲和孩子与家人分开开始:

df_mothers <- sample %>% 
  filter(Relatshinship == 2) %>% 
  print()

# A tibble: 7 × 9
  household.ID Member.ID Relatshinship birth.year Gender   age marriage.stuatus number.of.children.ever.born number.of.living.children
         <dbl>     <dbl>         <dbl>      <dbl>  <dbl> <dbl>            <dbl>                        <dbl>                     <dbl>
1     16523634  16529443             2       1348      2    47                1                            2                         2
2     16525912  16529230             2       1367      2    28                1                           NA                        NA
3     16540127  16545263             2       1319      2    75                1                            6                         4
4     16611764  16642611             2       1342      2    52                1                            2                         2
5     16643309  16650210             2       1336      2    58                1                            3                         3
6     16652356  16665128             2       1326      2    68                1                            3                         3
7     16672105  16681528             2       1352      2    43                1                            2                         2

df_children <- sample %>% 
  filter(Relatshinship == 3) %>% 
  group_by(household.ID) %>% 
  arrange(household.ID,desc(age)) %>% 
  mutate(birth.order = ordinal(row_number())) %>% 
  select(-c(marriage.stuatus, number.of.children.ever.born, number.of.living.children)) %>%
  print()

# A tibble: 7 × 7
# Groups:   household.ID [5]
  household.ID Member.ID Relatshinship birth.year Gender   age birth.order
         <dbl>     <dbl>         <dbl>      <dbl>  <dbl> <dbl> <chr>      
1     16523634  16532250             3       1376      1    19 1st        
2     16523634  16534992             3       1377      2    18 2nd        
3     16598050  16620223             3       1374      1    21 1st        
4     16611764  16650837             3       1367      1    28 1st        
5     16652356  16668381             3       1367      2    27 1st        
6     16672105  16685073             3       1372      1    23 1st        
7     16672105  16687491             3       1381      1    13 2nd        

好吧,这让你把孩子按家庭分组并显示出生顺序,但你想把孩子挤在每户一排(就像我的姐妹们挤在一间卧室里一样),所以为了救援:pivot_wider()

df_children_pivot <- df_children %>% 
  pivot_wider(id_cols = household.ID,
              names_from = birth.order,
              names_glue = "{birth.order}_born",
              values_from = age) %>% 
  print()

# A tibble: 5 × 3
# Groups:   household.ID [5]
  household.ID `1st_born` `2nd_born`
         <dbl>      <dbl>      <dbl>
1     16523634         19         18
2     16598050         21         NA
3     16611764         28         NA
4     16652356         27         NA
5     16672105         23         13

而现在,因为我们的残忍只能到此为止,让我们把我们从他们家里扯下来的孩子塞成一排,现在我们将让他们与他们的母亲团聚:

df_reunited <- df_mothers %>% 
  left_join(df_children_pivot, by = "household.ID") %>% 
  select(household.ID,Member.ID,birth.year,Gender,age,`1st_born`:last_col()) %>% 
  print()

# A tibble: 7 × 7
  household.ID Member.ID birth.year Gender   age `1st_born` `2nd_born`
         <dbl>     <dbl>      <dbl>  <dbl> <dbl>      <dbl>      <dbl>
1     16523634  16529443       1348      2    47         19         18
2     16525912  16529230       1367      2    28         NA         NA
3     16540127  16545263       1319      2    75         NA         NA
4     16611764  16642611       1342      2    52         28         NA
5     16643309  16650210       1336      2    58         NA         NA
6     16652356  16665128       1326      2    68         27         NA
7     16672105  16681528       1352      2    43         23         13

现在,我在上面假设这是合适的,并且所有孩子都有母亲在场,但我想你可能会有没有母亲的孤儿,所以根据实际数据,你可能需要调整你使用什么样的加入。但我认为这就是你想要的。left_join()