提问人:David 提问时间:6/21/2023 更新时间:7/9/2023 访问量:68
R 连接两个 data.table,其中一列是 actimed,第二列是 fuzzy
R join two data.table with with exact on one column and fuzzy on second
问:
我正在使用两个data.tables,
- 基于各种林分条件的预测产量随年龄变化
- 在特定田地位置的产量的田间测量,并测量年龄
我想找到最能预测田间数据测量产量的产量曲线。
library (fuzzyjoin)
library(data.table)
library(ggplot2)
# set up some dummy yield curves
species <- c("a", "b")
age <- seq(0:120)
s <- 10:12 # site difference
yield_db <- data.table(expand.grid(species=species, s=s, age=age))[
order(species, age, s)]
yield_db[species=="a", yield := 1.5*age+age*s*3]
yield_db[species=="b", yield := 0.75*age+age*s*2]
yield_db[, yc_id := .GRP, by = .(species, s)] # add a unique identifier
# generate some measurements - just add some noise to some sample yields
set.seed(1)
num_rows <- 3 # Set the desired number of rows
measurement_db <- yield_db[age>20][sample(.N,num_rows)]
measurement_db[,yield:=yield+runif(num_rows, min=-40, max=40)]
measurement_db[,age:=age+round(runif(num_rows, min=-5, max=5),0)]
# Plot the "measurements" against "yields"
ggplot(data = yield_db, aes(x=age, y=yield, colour=as.factor(yc_id))) +
geom_line() +
geom_point(data=measurement_db, aes(x=age, y=yield), colour="orange")
# Join to nearest yield
res <- difference_left_join(
measurement_db,
yield_db,
by=c("yield")
)
> res
species.x s.x age.x yield.x yc_id.x species.y s.y age.y yield.y yc_id.y
1 a 12 60 2375.364 3 b 12 96 2376.00 6
2 b 11 86 2035.079 5 a 11 59 2035.50 2
3 b 12 78 1845.943 6 b 10 89 1846.75 4
>
我想做的是强制连接保持年龄相同(即 age.x == age.y)和相同物种(即 species.x == species.y),并找到最接近匹配的产量曲线。
谢谢
答:
1赞
Jon Spring
6/21/2023
#1
如果收益率曲线在您想要比较的年龄都可用,您可以合并这些曲线并选择最接近的匹配。
measurement_db %>%
left_join(yield_db, join_by(species, age)) %>%
slice_min(abs(yield.y - yield.x), by = c(species, yc_id.x))
species s.x age yield.x yc_id.x s.y yield.y yc_id.y
1: a 12 60 2375.364 3 12 2250.0 3
2: b 11 86 2035.079 5 11 1956.5 5
3: b 12 78 1845.943 6 11 1774.5 5
2赞
David F
7/9/2023
#2
您还可以使用 data.table 的联接/滚动功能:
library(data.table)
setDT(yield_db)
setDT(measurement_db)
yield_db[measurement_db, on=c('species', 'age', 'yield'), roll='nearest']
species s age yield yc_id i.s i.yc_id
<fctr> <int> <int> <num> <int> <int> <int>
1: a 12 60 2375.364 3 12 3
2: b 11 86 2035.079 5 11 5
3: b 11 78 1845.943 5 12 6
评论
0赞
David
7/18/2023
令人惊奇 - 这得到了预期的结果,并在非常大的规模下立即工作。谢谢。
1赞
David F
7/20/2023
data.table 操作速度快得惊人
上一个:按组和间隔左连接数据帧
评论