有效比较 R 中一个变量相隔几行的两个值

Efficient comparison of two values several rows apart of one variable in R

提问人:ilka 提问时间:9/9/2022 更新时间:9/9/2022 访问量:66

问:

我正在使用 R 版本 4.2.1,并且我有一个可行的解决方案来实现我想要实现的目标(见下文)。然而,它的效率极低,并且会运行 ~4 天才能生成一个变量。因此,我正在寻找一种更有效的方式来实现我想要的结果。

数据和问题描述

我在几个时间段内对 700 家公司进行了大约 500,000 次观察。我的数据由 、 和 唯一标识。我感兴趣的是这些公司是否以及何时开始在特定目的地运营。我知道一家公司在哪个时期在哪个目的地运营。此信息是通过与另一个已存在的变量 (称为 )组合来提供的。 存储为,并提供有关公司是否在 所述目的地运营的信息。 可以是 , (= 公司在相应的目的地运营),或 (= 公司不在相应的目的地运营)。 是具有 66 个级别(例如,“美国”、“加拿大”等)的 a,这就是为什么对于每个 --组合,数据集中有 66 个观测值。data.tablefirm_idperioddestinationdestinationdestination_presencedestination_presencenumericdestinationdestination_presenceNA10destinationfactorfirm_idperiod

我的新变量可以是 , (= 公司在当前期间在相应的目的地开始运营), (= 公司在当前期间没有在相应的目的地开始运营)。因此,只有当公司在特定目的地开始运营时才会发生。请注意,这种情况可能会发生不止一次,例如,一家公司可能在第 2 期的目的地 D 开始运营,在第 4 期离开目的地 D,并在第 9 期再次进入目的地 D。internationalizationNA10internationalization == 1

下面是数据的简短示例:

数据示例

#load packages
library(data.table)

dt <- as.data.table(
  structure(list(
  firm_id = structure(as.factor(c(rep("f1", 18), rep("f2", 18), rep("f3", 18), rep("f4", 18)))),
  period = structure(as.factor(c(rep("3", 6), rep("5", 6), rep("6", 6), rep("1", 6), rep("2", 6), rep("3", 6), rep("0", 6), rep("1", 6), rep("2", 6), rep("7", 6), rep("8", 6), rep("9", 6)))), 
  min_period = structure(c(rep(3, 18), rep(1, 18), rep(0, 18), rep(7, 18))),
  destination = structure(as.factor(c("usa", "chile", "austria", "kenya", "china", "new zealand", "usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand"))),
  destination_presence = structure(c(rep(NA, 6), 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, rep(NA, 6), 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1,0, 0, 1, 1, 1, 1, rep(NA, 6)), class = "numeric")),
  .Names = c("firm_id", "period", "min_period", "destination", "destination_presence" ), row.names = c(NA, 5), class = "data.table"))

目前的做法

# load packages
library(data.table) 

# order data by firm_id, period, and destination to make sure that all data are similarly ordered
dt <-
  dt[with(dt, order(firm_id, period, destination)), ]

# Step 1: fill first variable for minimum periods as in these cases there is no prior period with which to compare
dt[, internationalization := ifelse(
  period == min_period & # min_period is the minimum period for a specific firm
    destination_presence == 1,
  1,
  NA
)]

# show internationalization variable output
summary(as.factor(dt$internationalization))

# Step 2:
# there are 6 rows for every firm_id-period combination because there are 6 different levels in the factor variable destination (i.e., 6 different countries) in the example data set 
# hence, for the first 6 rows there are no prior ones to compare with. therefore, start in row 7
for (i in 7:nrow(dt)) {
print(i) # print i to know about progress of loop
dt$internationalization[i] <-
# a) if there is already a value in internationalization, keep this value (output from Step 1)
ifelse(
!is.na(dt$internationalization[i]),
dt$internationalization[i],
# b) if there is no information on the international operation destinations of a firm in the current period, insert NA in internationalization
ifelse(
is.na(dt$destination_presence[i]),
NA,
# c) if in prior period (i-6 because of 6 country levels per firm_id-period entry) there are no information on destination presence, treat observations as first internationalization
ifelse(
is.na(dt$destination_presence[i - 6]) & dt$firm_id[i] == dt$firm_id[i - 6],
dt$destination_presence[i],
# c) if in last period (i - 6) a specific firm was not operating at a specific destination (dt$destination_presence[i - 6] != 1) and is operating at this specific destination in the current period (dt$destination_presence[i] == 1), set internationalization == 1
ifelse(
(dt$destination_presence[i] == 1) & (dt$destination_presence[i - 6] != 1) & (dt$firm_id[i] == dt$firm_id[i - 6]),
1,
0
)
)
)
)
}

预期结果

这应该与上述方法的结果相匹配。

# desired outcome
desired_dt <- as.data.table(
  structure(list(
    firm_id = structure(as.factor(c(rep("f1", 18), rep("f2", 18), rep("f3", 18), rep("f4", 18)))),
    period = structure(as.factor(c(rep("3", 6), rep("5", 6), rep("6", 6), rep("1", 6), rep("2", 6), rep("3", 6), rep("0", 6), rep("1", 6), rep("2", 6), rep("7", 6), rep("8", 6), rep("9", 6)))), 
    min_period = structure(c(rep(3, 18), rep(1, 18), rep(0, 18), rep(7, 18))),
    destination = structure(as.factor(c("usa", "chile", "austria", "kenya", "china", "new zealand", "usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand","usa", "chile", "austria", "kenya", "china", "new zealand"))),
    destination_presence = structure(c(rep(NA, 6), 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, rep(NA, 6), 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1,0, 0, 1, 1, 1, 1, rep(NA, 6)), class = "numeric"),
    internationalization = structure(c(rep(NA, 6), 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, rep(NA, 6), rep(0, 5), 1, rep(0,6), 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, rep(NA, 6)))),            
    .Names = c("firm_id", "period", "min_period", "destination", "destination_presence", "internationalization"), row.names = c(NA, 6), class = "data.table"))

期待您对如何提高代码效率的建议!

R 性能 for 循环 比较

评论

0赞 IRTFM 9/9/2022
我不认为data.table函数有行排序的概念。你有没有一个文档的链接来纠正我在这一点上的老化观念?
0赞 ilka 9/14/2022
你指的是这个代码块吗?这不是特定于 data.table 的,几个月前我还在处理另一个问题时,我在 stackoverflow 上发现了它。我只是还尝试用于设置数据集的标识符,但随后 for 循环实际上产生了错误的输出。我不确定这是否回答了您的评论。我会寻找 stackoverflow 帖子dt <- dt[with(dt, order(firm_id, period, destination)), ]setkey()

答:

1赞 jblood94 9/9/2022 #1

这可以通过使用 's 和 functions 的单个链式命令来完成。这将是非常快的。data.tablesetordershift

setorder(dt, firm_id, destination, period)[, internationalization := destination_presence*(firm_id != shift(firm_id, 1, "") | destination != shift(destination, 1, "") | !pmax(0, shift(destination_presence), na.rm = TRUE))]

请注意,未使用。min_period

评论

1赞 ilka 9/14/2022
伟大!非常感谢您的解决方案。它真的很快。
1赞 Bushidov 9/9/2022 #2

经过编辑,@jblood94代码包含在下面的表演中

循环是这里减慢代码速度的罪魁祸首。另一种选择将有助于加快这一进程。fortidyverse

法典


dt= as.data.frame(dt) #transform your data into a data frame
dt$id = 1:nrow(dt) # Add a unique row id to select them later
dt$period = as.numeric(dt$period) # Change the factor into numeric

#Create an intermediate dataframe only with the data of interest 
temp = dt %>% filter(destination_presence == 1) %>% 
  group_by(firm_id, destination) %>% 
  mutate(b = ifelse(lag(period)==period-1, 0, 1), #if period are consecutive transform to 0
         int = ifelse(is.na(b)|b==1, 1, 0))%>% #the final internationalization variable to be added in the original data frame
  select(-b) #remove the useless column

dt$inter = dt$destination_presence # Create the internationalization column based on the destination
dt[temp$id, "inter"] = temp$int # Transfer the column for the identified rows above

dt
    firm_id period min_period destination destination_presence internationalization
 1:      f1      3          3     austria                   NA                   NA
 2:      f1      5          3     austria                    0                    0
 3:      f1      6          3     austria                    0                    0
 4:      f1      3          3       chile                   NA                   NA
 5:      f1      5          3       chile                    0                    0
 6:      f1      6          3       chile                    0                    0
 7:      f1      3          3       china                   NA                   NA
 8:      f1      5          3       china                    0                    0
 9:      f1      6          3       china                    0                    0
10:      f1      3          3       kenya                   NA                   NA
11:      f1      5          3       kenya                    1                    1
12:      f1      6          3       kenya                    1                    0
13:      f1      3          3 new zealand                   NA                   NA
14:      f1      5          3 new zealand                    1                    1
15:      f1      6          3 new zealand                    1                    0
16:      f1      3          3         usa                   NA                   NA
17:      f1      5          3         usa                    0                    0

Performance


Edited: Code from @jblood94 added as fast_function

I warped up your code as and the code abode as . Your code is actually faster to run on the example data frame you provided. However when the number of row is increase the is far much effective. old_functionnew_functionnew_functionenter image description here enter image description here

评论

0赞 ilka 9/14/2022
Thanks so much for this alternative and shedding some light into the different performance outcomes of the approaches!