在 R 中处理大规模气候数据集的解决方案-解网

问：

我正在分析从 1950 年到 2023 年的 ERA5 气候数据，涵盖 73 年。该数据集的每日时间分辨率和空间分辨率为 0.25° （1440 x 720）。

每个年度数据都存储在一个 NetCDF 文件中，其中包含大约 365 个图层。

尝试将这些组合到单个堆栈中会导致大约 27000 个层，这对我的计算机来说是压倒性的，甚至工作场所集群也难以处理这个数量。

由于我是处理大规模数据集的新手，因此我正在寻求有关管理和处理此类数据的最佳实践和有效解决方案的建议。

注意：我更喜欢使用数据帧而不是 NetCDF 文件，因为我将执行时间序列分析。

注2：未来将进行不同类型的分析。它非常广泛：从简单的统计，到指数计算等。在网格单元级别。

注 3：我听说过以较小的块处理数据。在 R 中有一个好的包可以做到这一点吗？

R 栅格 NetCDF 区块

分解数据集的一个明显选项是按位置逐个像素地对时间序列进行采样。然而，这是一个糟糕的选择，因为您的数据分布在 73 个文件中，并且每个文件内的存储组织使得这是一个非常慢的选项（数据分散在每个文件中）。I/O 噩梦，甚至不考虑它。

时间切片

最好的办法是单独处理每个文件，然后在整个时间序列中将年度输出组合到数据对象中。如果从每日数据转到月度统计数据，则每个统计数据将有大约 9 亿个数据点（例如，每月最大 T、每月最小 T）。仍然非常重要，但在一台好的计算机上是可行的。这看起来有点像这样：

# Attach necessary packages
library(ncdf4)
library(CFtime)
library(abind)

# Get a list of your 73 ERA5 files
lf <- list.files("~/your/path/to/ERA5", "\\.nc$", full.names = TRUE)

# Open the files sequentially and process
mon <- lapply(lf, function(fn) {
  # Open the file and create a CFtime object
  nc <- nc_open(fn)
  cf <- CFtime(nc$dim$time$units, nc$dim$time$calendar, nc$dim$time$vals)

  # Create a factor to make monthly statistics
  fac <- CFfactor(cf, "month")

  # Read and process the data, here the monthly maximum T
  data <- ncvar_get(nc, "t2m")
  aperm(apply(data, 1:2, tapply, fac, max), c(2, 3, 1))
  dimnames(data) <- list(as.vector(nc$dim$longitude$vals),
                         as.vector(nc$dim$latitude$vals),
                         levels(fac))
  nc_close(nc)
  data
}

# `mon` is a list with 73 elements, each with 12 layers of monthly max T
# Now put them all into 1 object
all_mon <- abind(mon, along = 3)

这仍然为您提供了一个非常大的对象，但至少它是可管理的。如果你想要不同的统计信息，你可以编写自己的函数，然后调用它。aperm(apply(data, 1:2, tapply, fac, <<<here>>>), c(2, 3, 1))

Yes, this is the basic approach to processing data in smaller chunks. If you stay at the daily resolution then your data set will never reduce so every parameter you derive will have a total size of 50GB - 100GB. You are then probably better off storing the derived parameters per year, to keep the files manageable. Weeks have additional complications of definition (start on Monday or Sunday? which is week 1 in the year?) and they straddle year boundaries so I'd stay clear if you can.

0赞 Shunrei 11/16/2023

What about creating data cubes with defined dividing grid coordinates ?

0赞 Patrick 11/17/2023

Yes, you could easily implement your own data cubes by dividing up the lon-lat plane into, say, 10 x 10 degree tiles of all data per year. That would then become a mix of temporal slicing and sampling by location: I/O will be less optimal but on the CPU things may be better due to less RAM load.

0赞 Shunrei 11/17/2023

Do you recommend a way to be able to do it on nc files and transform them into raster stack cubes ?

上一个：比较两个指针

下一个：在 netCDF 文件上使用 xarray 时出现随机错误

在 R 中处理大规模气候数据集的解决方案

Solutions for handling large-scale climate dataset in R

评论

评论