在 R 中有效地将大型 data.frames 传递给类似 Apply 的函数-解网

问：

这是一个关于将大型数据集传递给类似 apply 的函数时的资源和效率的问题。

例

[编辑：更改了示例和描述，以说明多个表格的使用以及每个@UWE的评论的计算步骤]

library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.0.5 ...[snip]...

set.seed(10)

# Objective: Add period-cost per portion of person's selected fruit 

df.A <- data.frame(
  period = rep(1:3, times = 1, each = 4),
  Name = rep(c("John", "Paul","Ringo", "George"), times = 3),
  Fruit = sample(c("Apple", "Pear", "Banana", "Apple"), size = 12, replace = TRUE)
) # extend to many people, many periods, many fruit

df.B <- data.frame(
  Fruit = c("Pear", "Apple", "Banana"), 
  id = c(1, 2, 3),
  pound.per.portion = c(0.396832,0.440925,0.299829)
) # one entry per fruit

df.C <- data.frame(
  id = rep(1:3, times=3),
  period = rep(1:3, times = 1, each = 3),
  price.pound = c(2.33, 0.99, 2.15, 2.38, 1.01, 2.20, 2.42, 1.04, 2.25)
) # one entry per fruit per period

df.A
#>    period   Name  Fruit
#> 1       1   John Banana
#> 2       1   Paul  Apple
#> 3       1  Ringo   Pear
#> 4       1 George  Apple
#> 5       2   John  Apple
#> 6       2   Paul Banana
#> 7       2  Ringo  Apple
#> 8       2 George   Pear
#> 9       3   John Banana
#> 10      3   Paul Banana
#> 11      3  Ringo Banana
#> 12      3 George  Apple
df.B
#>    Fruit id pound.per.portion
#> 1   Pear  1          0.396832
#> 2  Apple  2          0.440925
#> 3 Banana  3          0.299829
df.C
#>   id period price.pound
#> 1  1      1        2.33
#> 2  2      1        0.99
#> 3  3      1        2.15
#> 4  1      2        2.38
#> 5  2      2        1.01
#> 6  3      2        2.20
#> 7  1      3        2.42
#> 8  2      3        1.04
#> 9  3      3        2.25

df.A$portion.price <- apply(df.A, MARGIN = 1, 
                           FUN = function(x, legend, prices){
                             # please ignore efficiency of this function
                             # the internal function is not the focus of the question
                             fruit.info <- df.B[df.B$Fruit == x[["Fruit"]],]

                             cost <- df.C %>% 
                               filter(period == x[["period"]],
                                      id == fruit.info[["id"]]) %>%
                               select(price.pound) %>%
                               `*`(fruit.info$pound.per.portion)
                             cost[[1]]
                           }, 
                           legend = df.B, prices = df.C) 
                            # Question relates to passing of legend and prices
                            # if `apply` passes df.B and df.C many times
                            # and df.B, df.C are large - is this inefficient, is there a better way

head(df.A, 5)  
#>   period   Name  Fruit portion.price
#> 1      1   John Banana     0.6446323
#> 2      1   Paul  Apple     0.4365157
#> 3      1  Ringo   Pear     0.9246186
#> 4      1 George  Apple     0.4365157
#> 5      2   John  Apple     0.4453343

^{创建于 2022-05-20 由 reprex 软件包（v2.0.1）}

此示例中的目标是向 df 添加一列。A 显示特定人在特定时期内选择的水果的部分成本。

存在三组数据，尽管它们都没有单独提供计算所需的所有信息。

df.A包含他们选择的水果的人物、时期和名称。每个期间的每个人都有一个条目。

df.C按时期提供水果的价格信息，但价格表示为每磅价格，而不是份价格，并且数据集无法识别水果名称（仅识别 ID 号）。每个时期的每个水果都有一个条目。

df.B提供缺少的信息。首先，它定义了，并提供了一个将每磅成本转换为每份成本的因子。每个水果只需要一个条目。df.C$iddf.A$Name

对于每一行，将人的水果名称和周期以及两个引用集（和）传递给函数。该函数查找必要的信息，从中引用数据，然后使用这些信息来计算每部分的成本（返回）。df.Aapplydf.Bdf.Cdf.Bdf.C

函数本身对于这个问题并不重要，只是说明使用多个数据集来查找每行的值。

这个例子很简单（四个人，三个时期，三个水果），并且非常易于管理;但是，从理论上讲，这些数据集中的每一个都可以包含数千行。apply

讨论主题从这里开始

如果我理解正确，传递值而不是引用。我相信这意味着上面示例中的函数为每一行创建一个和的新副本。假设这是正确的，这感觉效率不高，尤其是在数据集很大的情况下。rapplydf.Bdf.Cdf.A

在使用大型数据集时，有没有比这种查找/处理更好的解决方案？apply

我知道函数可以使用引用而不是值。是否会构建一个仅使用引用的自定义函数，或者是否有标准的现成方法？rcpprccpapply

R 循环内存参数传递

有趣的问题，但是，请您提供一个更复杂的函数，其中答案无法通过简单的连接得出。你看过吗？这可能是一种方式，因为它是为大型数据集设计的。它允许在联接的列上执行任意代码。它还具有允许同时加入和聚合的参数。但是，我需要看到一个不那么简单的用例来理解您的问题。谢谢。data.tableby = .EACHI

0赞 AWaddington 5/20/2022

@UWE - 我在示例中添加了更多详细信息。

1赞 Uwe 5/21/2022

感谢您更新答案。似乎连接是一种更好的方法，特别是对于大型 data.frames，而不是使用更适合矩阵和数组。请在下面的回答中找到更多详细信息。apply()

答：

2赞 Uwe 5/21/2022 #1

将函数与 data.frames 一起使用有一个主要缺点，因为在继续操作之前将 data.frame 强制转换为矩阵（参见 Patrick Burns 的 The R Inferno 的第 8.2.38 节）。apply()apply()

由于矩阵的所有元素都需要具有相同的类型，因此 data.frame 的所有列都被强制为一种通用数据类型。

这可以通过以下方式进行验证

apply(df.A, MARGIN = 2, str)

 chr [1:12] "1" "1" "1" "1" "2" "2" "2" "2" "3" "3" "3" "3"
 chr [1:12] "John" "Paul" "Ringo" "George" "John" "Paul" "Ringo" "George" "John" "Paul" "Ringo" "George"
 chr [1:12] "Banana" "Apple" "Pear" "Apple" "Apple" "Banana" "Apple" "Pear" "Banana" "Banana" "Banana" ...

在这里，整数列也被强制为键入字符。这很昂贵，并且可能会创建所有数据的副本。period

那么，我们能做些什么来实现OP的目标呢？

此示例中的目标是向 df 添加一列。A 显示特定人在特定时期。

恕我直言，实现目标的最佳方式是加入两次。

首先，创建一个查找表，其中包含 for each 和。然后，使用更新联接将列追加到：lutportion.priceFruitperioddf.A

library(data.table)
lut <- setDT(df.B)[df.C, on = .(id)][, portion.price := pound.per.portion * price.pound][]
setDT(df.A)[lut, on = .(Fruit, period), portion.price := i.portion.price][]

    period   Name  Fruit portion.price
 1:      1   John Banana     0.6446323
 2:      1   Paul  Apple     0.4365157
 3:      1  Ringo   Pear     0.9246186
 4:      1 George  Apple     0.4365157
 5:      2   John  Apple     0.4453343
 6:      2   Paul Banana     0.6596238
 7:      2  Ringo  Apple     0.4453343
 8:      2 George   Pear     0.9444602
 9:      3   John Banana     0.6746152
10:      3   Paul Banana     0.6746152
11:      3  Ringo Banana     0.6746152
12:      3 George  Apple     0.4585620

data.table旨在有效地处理大型数据集。

或者，可以使用 SQL：

sqldf::sqldf("
select period, Name, Fruit, `portion.price` from `df.A` 
  left join (
    select Fruit, period, 
      `pound.per.portion` * `price.pound` as `portion.price` from `df.B`  
      join `df.C` using(id) 
       ) using(period, Fruit)
")

   period   Name  Fruit portion.price
1       1   John Banana     0.6446323
2       1   Paul  Apple     0.4365157
3       1  Ringo   Pear     0.9246186
4       1 George  Apple     0.4365157
5       2   John  Apple     0.4453343
6       2   Paul Banana     0.6596238
7       2  Ringo  Apple     0.4453343
8       2 George   Pear     0.9444602
9       3   John Banana     0.6746152
10      3   Paul Banana     0.6746152
11      3  Ringo Banana     0.6746152
12      3 George  Apple     0.4585620

请注意，某些表名和列名括在反引号中，因为句点在 SQL 中具有特殊含义

谢谢@UWE - 查看文档，我看到这个解决方案直接解决了我对复制数据集的担忧。“用 data.table 的说法，所有 set* 函数都通过引用来更改其输入。也就是说，除了临时工作记忆之外，根本没有进行任何复制，它只有一列那么大。setDT

上一个：将具有相同名称的额外（点）参数传递给不同的内部函数

下一个：将大量内存传递给类的构造函数的正确方法是什么？

在 R 中有效地将大型 data.frames 传递给类似 Apply 的函数

Passing Large data.frames to Apply-like Functions Efficiently in R

评论

评论