提问人:Hack-R 提问时间:11/2/2015 最后编辑:Hack-R 更新时间:11/2/2015 访问量:71
如何在 R 中的 ID 级别应用 Marshall Palmer 函数?
How to apply Marshall Palmer function at ID level in R?
问:
我正在分析双偏振雷达数据,我想将 Marshall Palmer 关系的结果作为 ID 级变量添加到我的数据中。
我找不到 CRAN 函数,但另一个 R 用户有脚本,他在其中将该关系应用为数据中预期值的估计值:
# From Troy W. (thanks!)
# A few small changes by hack-r
## Someone better in R than me could probably clean up/refactor the code a bit.
library(dplyr)
library(data.table)
test <- fread('../input/test.csv')
mpalmer <- function(ref, minutes_past) {
# order reflectivity values and minutes_past
sort_min_index = order(minutes_past)
minutes_past <- minutes_past[sort_min_index]
ref <- ref[sort_min_index]
# calculate the length of time for which each reflectivity value is valid
valid_time <- rep(0, length(minutes_past))
valid_time[1] <- minutes_past[1]
if (length(valid_time) > 1) {
for (i in seq(2, length(minutes_past))) {
valid_time[i] <- minutes_past[i] - minutes_past[i-1]
}
valid_time[length(valid_time)] = valid_time[length(valid_time)] + 60 - sum(valid_time)
} else {
# if only 1 observation, make it valid for the entire hour
valid_time <- 60
}
valid_time = valid_time / 60
# calculate hourly rain rates using marshall-palmer weighted by valid times
sum <- 0
for (i in seq(length(ref))) {
if (!is.na(ref[i])) {
mmperhr <- ((10^(ref[i]/10))/200) ^ 0.625
sum <- sum + mmperhr * valid_time[i]
}
}
return(sum)
}
results <- test %>% group_by(Id) %>% summarize(Expected=sum)
write.csv(results, file='sample_solution.csv', row.names=FALSE)
除了大数据的速度非常慢之外,上述代码的问题在于它不会在原始数据中创建一列结果,这将允许它在 ETL 管道中生产化,该管道在 ID 级别创建此关系作为数据集中的 1 个预测变量。
我尝试像这样重写函数:
mpalmer <- function(ref, minutes_past) {
# Credit to Troy for this
# edits by Jason Miller, hack-r.com
# order reflectivity values and minutes_past
sort_min_index = order(minutes_past)
minutes_past <- minutes_past[sort_min_index]
ref <- ref[sort_min_index]
# calculate the length of time for which each reflectivity value is valid
valid_time <- rep(0, length(minutes_past))
valid_time[1] <- minutes_past[1]
if (length(valid_time) > 1) {
for (i in seq(2, length(minutes_past))) {
valid_time[i] <- minutes_past[i] - minutes_past[i-1]
}
valid_time[length(valid_time)] = valid_time[length(valid_time)] + 60 - sum(valid_time)
} else {
# if only 1 observation, make it valid for the entire hour
valid_time <- 60
}
valid_time = valid_time / 60
# calculate hourly rain rates using marshall-palmer weighted by valid times
sum <- 0
for (i in seq(length(ref))) {
if (!is.na(ref[i])) {
mmperhr <- ((10^(ref[i]/10))/200) ^ 0.625
sum <- sum + mmperhr * valid_time[i]
}
}
return(sum)
}
然后像这样应用它:
train.samp$mp <- aggregate(train.samp$Ref, by=list(train.samp$Id), FUN = mpalmer, minutes_past = train.samp$minutes_past)
我认为这大部分都有效,但是在运行很长时间后,它返回了如下错误:
Error in `$<-.data.frame`(`*tmp*`, "mp", value = list(Group.1 = c(10L, :
replacement has 9765 rows, data has 10000
我已经在不同的数据样本上尝试过,错误消息始终采用该格式,尽管具体数字可能会发生变化。数据集中没有缺失的数据。
知道如何修复此功能(和/或使其更快)吗?
更新:我已经让它与 for 循环一起工作,但它太慢了......
答:
0赞
Hack-R
11/2/2015
#1
这就是我现在要做的事情,但我仍然对其他解决方案持开放态度。
基本上,我回到了原来的函数,然后将过大的数据集分解成可管理的块,并在每个块上运行循环:
train.samp <- train.samp[order(train.samp$Id),]
train.samp1 <- train.samp1[order(train.samp1$Id),]
train.samp.id <- unique(train.samp$Id)
train.samp.id.1 <- train.samp.id[1:25000]
train.samp.id.2 <- train.samp.id[25001:50000]
train.samp.id.3 <- train.samp.id[50001:75000]
train.samp.id.4 <- train.samp.id[75001:100000]
train.samp.id.6 <- train.samp.id[100001:125000]
train.samp.id.5 <- train.samp.id[125001:150000]
train.samp.id.7 <- train.samp.id[150001:175000]
train.samp.id.8 <- train.samp.id[175001:200000]
train.samp.id.9 <- train.samp.id[200001:length(train.samp.id)]
train.samp.1 <- train.samp[train.samp$Id %in% train.samp.id.1,]
train.samp.2 <- train.samp[train.samp$Id %in% train.samp.id.2,]
train.samp.3 <- train.samp[train.samp$Id %in% train.samp.id.3,]
train.samp.4 <- train.samp[train.samp$Id %in% train.samp.id.4,]
train.samp.5 <- train.samp[train.samp$Id %in% train.samp.id.5,]
train.samp.6 <- train.samp[train.samp$Id %in% train.samp.id.6,]
train.samp.7 <- train.samp[train.samp$Id %in% train.samp.id.7,]
train.samp.8 <- train.samp[train.samp$Id %in% train.samp.id.8,]
train.samp.9 <- train.samp[train.samp$Id %in% train.samp.id.9,]
system.time(
for(i in unique(train.samp.1$Id)){
train.samp.1$mp[train.samp.1$Id == i] <- mpalmer(train.samp.1$Ref[train.samp.1$Id == i], minutes_past = train.samp.1$minutes_past[train.samp.1$Id == i])
} )
for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.2])){
train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
}
for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.3])){
train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
}
for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.4])){
train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
}
for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.5])){
train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
}
for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.6])){
train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
}
for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.7])){
train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
}
for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.8])){
train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
}
for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.9])){
train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
}
system.time(
for(i in unique(train.samp1$Id)){
train.samp1$mp[train.samp1$Id == i] <- mpalmer(train.samp1$Ref[train.samp1$Id == i], minutes_past = train.samp1$minutes_past[train.samp1$Id == i])
}
此处未显示该功能,但我即将利用@Gregor在评论中的建议。
评论
for
valid_time[i] <- minutes_past[i] - minutes_past[i-1]
sum
sum = sum(((10^(ref/10))/200) ^ 0.625 * valid_time, na.rm = T)
for
for