R 连接两个 data.table,其中一列是 actimed,第二列是 fuzzy

R join two data.table with with exact on one column and fuzzy on second

提问人:David 提问时间:6/21/2023 更新时间:7/9/2023 访问量:68

问:

我正在使用两个data.tables,

  1. 基于各种林分条件的预测产量随年龄变化
  2. 在特定田地位置的产量的田间测量,并测量年龄

我想找到最能预测田间数据测量产量的产量曲线。

library (fuzzyjoin)
library(data.table)
library(ggplot2)

# set up some dummy yield curves
species <- c("a", "b") 
age <- seq(0:120)
s <- 10:12 # site difference
yield_db <- data.table(expand.grid(species=species, s=s, age=age))[
  order(species, age, s)]
yield_db[species=="a", yield := 1.5*age+age*s*3]
yield_db[species=="b", yield := 0.75*age+age*s*2]
yield_db[, yc_id := .GRP, by = .(species, s)] # add a unique identifier

# generate some measurements - just add some noise to some sample yields
set.seed(1)
num_rows <- 3  # Set the desired number of rows
measurement_db <- yield_db[age>20][sample(.N,num_rows)]
measurement_db[,yield:=yield+runif(num_rows, min=-40, max=40)]
measurement_db[,age:=age+round(runif(num_rows, min=-5, max=5),0)]

# Plot the "measurements" against "yields"
ggplot(data = yield_db, aes(x=age, y=yield, colour=as.factor(yc_id))) +
  geom_line() +
  geom_point(data=measurement_db, aes(x=age, y=yield), colour="orange")

# Join to nearest yield
res <- difference_left_join(
  measurement_db,
  yield_db,
  by=c("yield")
)

enter image description here

> res
  species.x s.x age.x  yield.x yc_id.x species.y s.y age.y yield.y yc_id.y
1         a  12    60 2375.364       3         b  12    96 2376.00       6
2         b  11    86 2035.079       5         a  11    59 2035.50       2
3         b  12    78 1845.943       6         b  10    89 1846.75       4
> 

我想做的是强制连接保持年龄相同(即 age.x == age.y)和相同物种(即 species.x == species.y),并找到最接近匹配的产量曲线。

谢谢

r data.table 模糊连接

评论


答:

1赞 Jon Spring 6/21/2023 #1

如果收益率曲线在您想要比较的年龄都可用,您可以合并这些曲线并选择最接近的匹配。

measurement_db %>%
  left_join(yield_db, join_by(species, age)) %>%
  slice_min(abs(yield.y - yield.x), by = c(species, yc_id.x))


   species s.x age  yield.x yc_id.x s.y yield.y yc_id.y
1:       a  12  60 2375.364       3  12  2250.0       3
2:       b  11  86 2035.079       5  11  1956.5       5
3:       b  12  78 1845.943       6  11  1774.5       5
2赞 David F 7/9/2023 #2

您还可以使用 data.table 的联接/滚动功能:

library(data.table)
setDT(yield_db)
setDT(measurement_db)

yield_db[measurement_db, on=c('species', 'age', 'yield'), roll='nearest']

   species     s   age    yield yc_id   i.s i.yc_id
    <fctr> <int> <int>    <num> <int> <int>   <int>
1:       a    12    60 2375.364     3    12       3
2:       b    11    86 2035.079     5    11       5
3:       b    11    78 1845.943     5    12       6

评论

0赞 David 7/18/2023
令人惊奇 - 这得到了预期的结果,并在非常大的规模下立即工作。谢谢。
1赞 David F 7/20/2023
data.table 操作速度快得惊人