识别 R 中两个不同数据帧中未公开的观测值与公开的观测值的匹配-解网

问：

我在 R 中有两个数据帧，我需要使用它们来为每个暴露的个体找到两个最佳匹配项（如果只有一个可能，则为一个，如果不可能，则为 0）。

一个数据帧（称为 df）是具有 1,000 个观测值的数据帧，其中 34 个观测值已公开，966 个观测值未公开。它包含有关 ID、暴露、性别、出生年份、肤色和开始日期的信息。开始日期仅适用于 34 个公开的观测点（即定义暴露的观测点）。

下面是此类数据帧的可重现示例：

# Set the seed for reproducibility
set.seed(123)

# Create the ID column from 1 to 1000
id <- 1:1000

# Create the exposure column with 966 zeros and 34 ones
exposure <- sample(c(0, 1), size = 1000, replace = TRUE, prob = c(966/1000, 34/1000))

# Create the sex column with 55% Male (M) and 45% Female (F)
sex <- sample(c("Male", "Female"), size = 1000, replace = TRUE, prob = c(0.55, 0.45))

# Create the birthyear column with random numbers from 1961 to 1981
birthyear <- sample(1961:1981, size = 1000, replace = TRUE)

# Create the colour column with specified distribution
colour <- sample(c("green", "red", "blue", "yellow"), size = 1000, replace = TRUE, prob = c(0.35, 0.2, 0.3, 0.15))

# Create the date_start column with specified conditions
date_start <- ifelse(exposure == 0, NA, sample(seq(as.Date("2009-11-23"), as.Date("2020-05-11"), by = "days"), size = sum(exposure), replace = TRUE))

# Create the dataframe
df <- data.frame(ID = id, Exposure = exposure, Sex = sex, Birthyear = birthyear, Colour = colour, Date_Start = as.Date(date_start, origin = "1970-01-01"))

# Print the first few rows of the dataframe
head(df)

另一个数据帧（称为 id_date_combinations）是包含所有 1,000 个观测值的数据帧，每个 ID 有一行，日期从 2018-01-01 到 2021-12-31，测量值为“a”、“b”和“c”。

下面是此类数据帧的可重现示例：

# Create a vector of all IDs
all_ids <- 1:1000

# Create a vector of all dates ranging from 2018-01-01 to 2021-12-31
all_dates <- seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "days")

# Generate all combinations of IDs and dates
id_date_combinations <- expand.grid(ID = all_ids, Date = all_dates)

# Repeat each combination three times for measurements "a", "b", and "c"
id_date_combinations <- id_date_combinations[rep(1:nrow(id_date_combinations), each = 3), ]
id_date_combinations$Measurement <- rep(c("a", "b", "c"), times = nrow(id_date_combinations) / 3)

id_date_combinations$Value <- sample(10000, size = nrow(id_date_combinations), replace = TRUE)

# Print the first few rows of the dataframe
head(id_date_combinations)

id_date_combinations %>% group_by(Measurement) %>% tally()

现在需要做的是将 df 数据框中的每个暴露观察值与具有相同性别、出生年份、颜色的所有未暴露个体进行匹配。然后，如果这些观测值分别具有“a”、“b”和“c”的测量值，则应在id_date_combinations中检查这些观测值，该测量值应在 df 数据帧中暴露的观测值Date_Start后的 730 天内。

当为每个暴露的观测点标识出满足此条件的所有观测值时，应识别两个（如果只有一个可用，则为一个，如果没有可用，则为零个）未暴露观测值，其距离“a”、“b”和“c”的值列中与暴露观测值最近的距离，并创建一个新数据框，其中一列是公开观测值的 ID，一列是未曝光观测值的 ID（如果有）或观察。因此，每个公开的 ID 可以有 0 行（如果未标识匹配的观测值）、1 行（如果标识了一个匹配的观测值9）或 2 行（如果标识了两个匹配的观测值）。

匹配应在不更换的情况下进行，因为这种未曝光的观察结果只能选择一次。

匹配过程应按时间顺序进行，即首先应将未曝光的观测值与具有最早Date_Start的曝光观测值进行匹配，依此类推。

r 倾向得分匹配

答： 暂无答案

上一个：使用 posgresql to_tsquery列的前缀匹配，可能具有多个值

下一个：在 R 中链接具有记录扩展的记录 [duplicate]

识别 R 中两个不同数据帧中未公开的观测值与公开的观测值的匹配

Identifying matched unexposed observations to exposed observations from two different dataframes in R

评论