如何在 R 中的 ID 级别应用 Marshall Palmer 函数?

How to apply Marshall Palmer function at ID level in R?

提问人:Hack-R 提问时间:11/2/2015 最后编辑:Hack-R 更新时间:11/2/2015 访问量:71

问:

我正在分析双偏振雷达数据,我想将 Marshall Palmer 关系的结果作为 ID 级变量添加到我的数据中。

我找不到 CRAN 函数,但另一个 R 用户有脚本,他在其中将该关系应用为数据中预期值的估计值:

# From Troy W. (thanks!)
# A few small changes by hack-r

## Someone better in R than me could probably clean up/refactor the code a bit.


library(dplyr)
library(data.table)

test <- fread('../input/test.csv')


mpalmer <- function(ref, minutes_past) {

  # order reflectivity values and minutes_past
  sort_min_index = order(minutes_past)
  minutes_past <- minutes_past[sort_min_index]
  ref <- ref[sort_min_index]

  # calculate the length of time for which each reflectivity value is valid
  valid_time <- rep(0, length(minutes_past))
  valid_time[1] <- minutes_past[1]
  if (length(valid_time) > 1) {
    for (i in seq(2, length(minutes_past))) {
      valid_time[i] <- minutes_past[i] - minutes_past[i-1]
    }
    valid_time[length(valid_time)] = valid_time[length(valid_time)] + 60 - sum(valid_time)
  } else {
    # if only 1 observation, make it valid for the entire hour
    valid_time <- 60
  }

  valid_time = valid_time / 60

  # calculate hourly rain rates using marshall-palmer weighted by valid times
  sum <- 0
  for (i in seq(length(ref))) {
    if (!is.na(ref[i])) {
      mmperhr <- ((10^(ref[i]/10))/200) ^ 0.625
      sum <- sum + mmperhr * valid_time[i]
    }
  }

  return(sum)

}
results <- test %>% group_by(Id) %>% summarize(Expected=sum)
write.csv(results, file='sample_solution.csv', row.names=FALSE)

除了大数据的速度非常慢之外,上述代码的问题在于它不会在原始数据中创建一列结果,这将允许它在 ETL 管道中生产化,该管道在 ID 级别创建此关系作为数据集中的 1 个预测变量。

我尝试像这样重写函数:

mpalmer <- function(ref, minutes_past) {
  # Credit to Troy for this
  # edits by Jason Miller, hack-r.com

  # order reflectivity values and minutes_past
  sort_min_index = order(minutes_past)
  minutes_past <- minutes_past[sort_min_index]
  ref <- ref[sort_min_index]

  # calculate the length of time for which each reflectivity value is valid
  valid_time <- rep(0, length(minutes_past))
  valid_time[1] <- minutes_past[1]
  if (length(valid_time) > 1) {
    for (i in seq(2, length(minutes_past))) {
      valid_time[i] <- minutes_past[i] - minutes_past[i-1]
    }
    valid_time[length(valid_time)] = valid_time[length(valid_time)] + 60 - sum(valid_time)
  } else {
    # if only 1 observation, make it valid for the entire hour
    valid_time <- 60
  }

  valid_time = valid_time / 60

  # calculate hourly rain rates using marshall-palmer weighted by valid times
  sum <- 0
  for (i in seq(length(ref))) {
    if (!is.na(ref[i])) {
      mmperhr <- ((10^(ref[i]/10))/200) ^ 0.625
      sum <- sum + mmperhr * valid_time[i]
    }
  }

  return(sum)

}

然后像这样应用它:

train.samp$mp <- aggregate(train.samp$Ref, by=list(train.samp$Id), FUN = mpalmer, minutes_past = train.samp$minutes_past)

我认为这大部分都有效,但是在运行很长时间后,它返回了如下错误:

Error in `$<-.data.frame`(`*tmp*`, "mp", value = list(Group.1 = c(10L,  : 
  replacement has 9765 rows, data has 10000

我已经在不同的数据样本上尝试过,错误消息始终采用该格式,尽管具体数字可能会发生变化。数据集中没有缺失的数据。

知道如何修复此功能(和/或使其更快)吗?

更新:我已经让它与 for 循环一起工作,但它慢了......

R 物理

评论

0赞 Gregor Thomas 11/2/2015
一个快速解决方法:将循环替换为 valid_time[2:length(minutes_past)] = diff(minutes_past)'。forvalid_time[i] <- minutes_past[i] - minutes_past[i-1]
0赞 Hack-R 11/2/2015
@Gregor我验证了这有效,并且在大型测试数据集上将分析速度提高了约 3% - 4%。谢谢你!
0赞 Gregor Thomas 11/2/2015
看起来最后的计算可以矢量化,也许而不是循环?sumsum = sum(((10^(ref/10))/200) ^ 0.625 * valid_time, na.rm = T)for
0赞 Hack-R 11/2/2015
@Gregor 好主意。我今天下班后会探讨这个问题,看看我是否能摆脱这个循环。如果您想发布该功能的另一个版本以及您的改进,那么我很乐意将其标记为解决方案,因为您是唯一帮助我的人。for

答:

0赞 Hack-R 11/2/2015 #1

这就是我现在要做的事情,但我仍然对其他解决方案持开放态度。

基本上,我回到了原来的函数,然后将过大的数据集分解成可管理的块,并在每个块上运行循环:

  train.samp      <- train.samp[order(train.samp$Id),]
  train.samp1     <- train.samp1[order(train.samp1$Id),]

  train.samp.id    <- unique(train.samp$Id)
  train.samp.id.1  <- train.samp.id[1:25000]
  train.samp.id.2  <- train.samp.id[25001:50000]
  train.samp.id.3  <- train.samp.id[50001:75000]
  train.samp.id.4  <- train.samp.id[75001:100000]
  train.samp.id.6  <- train.samp.id[100001:125000]
  train.samp.id.5  <- train.samp.id[125001:150000]
  train.samp.id.7  <- train.samp.id[150001:175000]
  train.samp.id.8  <- train.samp.id[175001:200000]
  train.samp.id.9  <- train.samp.id[200001:length(train.samp.id)]

  train.samp.1 <- train.samp[train.samp$Id %in% train.samp.id.1,]
  train.samp.2 <- train.samp[train.samp$Id %in% train.samp.id.2,]
  train.samp.3 <- train.samp[train.samp$Id %in% train.samp.id.3,]
  train.samp.4 <- train.samp[train.samp$Id %in% train.samp.id.4,]
  train.samp.5 <- train.samp[train.samp$Id %in% train.samp.id.5,]
  train.samp.6 <- train.samp[train.samp$Id %in% train.samp.id.6,]
  train.samp.7 <- train.samp[train.samp$Id %in% train.samp.id.7,]
  train.samp.8 <- train.samp[train.samp$Id %in% train.samp.id.8,]
  train.samp.9 <- train.samp[train.samp$Id %in% train.samp.id.9,]

  system.time(
  for(i in unique(train.samp.1$Id)){
    train.samp.1$mp[train.samp.1$Id == i] <- mpalmer(train.samp.1$Ref[train.samp.1$Id == i], minutes_past = train.samp.1$minutes_past[train.samp.1$Id == i])
  }    )
  for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.2])){
    train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
  }    
  for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.3])){
    train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
  }    
  for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.4])){
    train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
  }    
  for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.5])){
    train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
  }    
  for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.6])){
    train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
  }    
  for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.7])){
    train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
  }    
  for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.8])){
    train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
  }    
  for(i in unique(train.samp$Id[train.samp$Id %in% train.samp.id.9])){
    train.samp$mp[train.samp$Id == i] <- mpalmer(train.samp$Ref[train.samp$Id == i], minutes_past = train.samp$minutes_past[train.samp$Id == i])
  }    

system.time(  
  for(i in unique(train.samp1$Id)){
    train.samp1$mp[train.samp1$Id == i] <- mpalmer(train.samp1$Ref[train.samp1$Id == i], minutes_past = train.samp1$minutes_past[train.samp1$Id == i])
  }   

此处未显示该功能,但我即将利用@Gregor在评论中的建议。