以 rbind（）样式创建数据帧的快速方法-解网

问：

我正在使用两个调查数据集。第一个（调查1）是人口/家庭调查（如美国CPS或欧洲EU-SILC），大约有150,000个观察结果;第二个（survey2）是家庭预算调查（如欧洲哈佛商学院），大约有50,000个观察结果。我必须将调查 2 中的家庭支出归入调查 1：为此，我使用家庭特征在两个调查之间创建了一个基于距离的匹配，其中调查 1 中的每个家庭都与调查 2 中的一个家庭匹配。

我有一个基于 survey2 的相当大的数据集（超过 200 万行），其中包含三列：家庭 ID、产品 ID 和支出价值（数据帧称为 survey2_expenditures）。我正在尝试为 survey1 创建一个类似的数据集，其中我将为 survey1 中的每个家庭 ID 提供 survey2 中匹配家庭的每个产品 ID 的估算支出值。例如，如果调查 1 中的 12 号住户与 2 中的 23 号住户匹配，则此新数据集的第一列中将包含 12 号住户的 ID，其他列中将包含 23 号住户的产品 ID 和支出值。也就是说，如果在原始 survey2 数据集中，我们有：

> survey2_expenditures %>% filter(id == 23)
   id product     value 
1  23    6001 81.700000
2  23    7001 50.286667
3  23    2400 88.356667
4  23   24022 33.973333
5  23   30001 160.00000
6  23   30002 24.380000
7  23   30014 57.910000

我尝试创建的新数据集将具有：

> survey1_expenditures %>% filter(id == 12)
   id product     value 
1  12    6001 81.700000
2  12    7001 50.286667
3  12    2400 88.356667
4  12   24022 33.973333
5  12   30001 160.00000
6  12   30002 24.380000
7  12   30014 57.910000

由于第一次调查比第二次调查大约 3 倍，因此我预计这个新数据集大约有 600 万行。

我尝试使用 for 循环从 survey2 的数据集中获取数据并创建新的 survey1 支出数据集。首先，我创建了一个只有三个列名的空数据帧：

survey1_expenditures <- data.frame(matrix(ncol=3,nrow=0, 
                                       dimnames=list(NULL, c("id", "product", "value"))))

然后，我运行了以下 for 循环，其中匹配的是数据帧，其中一列包含来自 survey1 的原始家庭 ID，另一列包含来自 survey2 的匹配家庭 ID：

for(id_survey1 in matching$id_survey1){ #looping through each household ID in the survey1
  id_survey2 <- matching$id_survey2[id_survey1 ] #get match for that household
  
  matched_expenditures <- survey2_expenditures %>% #
    filter(id == id_survey2) %>% #filter rows from matched household
    mutate(id = id_survey1) %>% #substitute survey2's ID with survey1's ID
    select(id, product, value) #select only ID, product ID and expenditure value
  
  survey1_expenditures <- rbind(survey1_expenditures, matched_expenditures)
}

尽管它似乎有效，但这非常非常慢。我还尝试创建一个数据帧列表，然后用 data.table 的 rbindlist（）重新绑定它们，但它也很慢。有没有更快的方法来构建我想要的数据集？

R 性能 RBIND

survey2_expenditures <- read.table(text = 
"id product     value 
1  23    6001 81.700000
2  23    7001 50.286667
3  23    2400 88.356667
4  23   24022 33.973333
5  23   30001 160.00000
6  23   30002 24.380000
7  23   30014 57.910000"
, header = TRUE)

survey1_expenditures <- read.table(text = 
"id product     value 
1  12    6001 81.700000
2  12    7001 50.286667
3  12    2400 88.356667
4  12   24022 33.973333
5  12   30001 160.00000
6  12   30002 24.380000
7  12   30014 57.910000"
, header = TRUE)

library(dplyr)

left_join(
  survey1_expenditures %>% select(id, value),
  survey2_expenditures %>% select(product, value),
  by = "value"
)[c(1L, 3L, 2L)]
#>   id product     value
#> 1 12    6001  81.70000
#> 2 12    7001  50.28667
#> 3 12    2400  88.35667
#> 4 12   24022  33.97333
#> 5 12   30001 160.00000
#> 6 12   30002  24.38000
#> 7 12   30014  57.91000

^{创建于 2023-10-06 with reprex v2.0.2}

上一个：追加到 Excel 时出现问题

下一个：按 2 列中的元素对 2 个数据帧进行 RBIND，避免嵌套循环

以 rbind（）样式创建数据帧的快速方法

Fast way to create dataframe in a rbind() style

评论

以 rbind（） 样式创建数据帧的快速方法

Fast way to create dataframe in a rbind() style

评论

以 rbind（）样式创建数据帧的快速方法