计算 data.table 的 'set（）' 中的 'i'-解网

问：

想象一下，我正在一个对象中创建一个新列：data.table

require(data.table)
data(iris)
dt.iris <- data.table(iris)

dt.iris[,shortSpecies:=substr(Species,1,5)]

现在，我想要一个函数，而不是直接使用，将用于创建列的代码作为参数，然后对其进行计算。我最终得到了这个：:=

make_new_col <- function(inputDT, newColName, construction){
  set(
    x = inputDT, 
    j = newColName, 
    value=eval(expr = parse(text = construction), envir=inputDT)
    )
}

dt.iris <- make_new_col(
  inputDT = dt.iris, 
  newColName = 'shortSpecies', 
  construction = 'substr(Species,1,5)'
)

这有效，但现在我想添加一个条件，即等效于 .我需要以某种方式将条件传递给要评估的部分，但我找不到可行的解决方案。dt.iris[Sepal.Length>5,shortSpecies:=substr(Species,1,5)]i=set()

r 数据表

make_new_col_cond <- function(inputDT, newColName, condition=NULL, construction){
  
  ret_i <- NULL
  ret_val <- eval(expr = parse(text = construction), envir=inputDT)
  
  if (!is.null(condition)) {
    ret_i <- which(eval(expr = parse(text=condition), envir=inputDT))
    ret_val <- ret_val[ret_i]
  }
  
  set(
    i = ret_i,
    x = inputDT, 
    j = newColName, 
    value = ret_val
    )
}

# usage:
dt.iris <- make_new_col_cond(
  inputDT = dt.iris, 
  condition = 'Sepal.Length>5',
  newColName = 'shortSpecies', 
  construction = 'substr(Species,1,5)'
)

看起来有点丑，但它可以完成工作。任何关于优化的建议将不胜感激。

0赞 jay.sf 11/11/2023 #2

您可以提供一个标准条件是。ifcondmissing

> library(data.table)
> make_new_col3 <- function(inputDT, cond, newColName, construction) {
+   stopifnot(ncol(inputDT) > 0)
+   if (missing(cond)) cond <- "`mode<-`(inputDT[[1]], 'logical')"
+   set(
+     x=inputDT, 
+     i=(w <- which(eval(parse(text=cond), envir=inputDT))),
+     j=newColName, 
+     value=eval(parse(text=construction), envir=inputDT[w, ])
+   )
+ }

> dt.iris <- make_new_col3(
+   inputDT=dt.iris, 
+   cond='Sepal.Length > 5',
+   newColName='shortSpecies', 
+   construction='substr(Species, 1, 5)'
+ )
> head(dt.iris)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species shortSpecies
1:          5.1         3.5          1.4         0.2  setosa        setos
2:          4.9         3.0          1.4         0.2  setosa         <NA>
3:          4.7         3.2          1.3         0.2  setosa         <NA>
4:          4.6         3.1          1.5         0.2  setosa         <NA>
5:          5.0         3.6          1.4         0.2  setosa         <NA>
6:          5.4         3.9          1.7         0.4  setosa        setos

> dt.iris <- make_new_col3(
+   inputDT=dt.iris, 
+   # cond='Sepal.Length > 5',
+   newColName='shortSpecies', 
+   construction='substr(Species, 1, 5)'
+ )
> head(dt.iris)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species shortSpecies
1:          5.1         3.5          1.4         0.2  setosa        setos
2:          4.9         3.0          1.4         0.2  setosa        setos
3:          4.7         3.2          1.3         0.2  setosa        setos
4:          4.6         3.1          1.5         0.2  setosa        setos
5:          5.0         3.6          1.4         0.2  setosa        setos
6:          5.4         3.9          1.7         0.4  setosa        setos

数据：

dt.iris <- as.data.table(iris)

@VasilyA 需要为参数提供适当的替代，将第一列转换为逻辑列，并得到一个向量（或者如果有 's）。因此，每一行都将获得一个短名称。应该做同样的事情，更简洁，可能更快。text=mode<-dt.iris[[1]]c(TRUE, TRUE, TRUE, ...)c(TRUE, NA, TRUE, ...)NATRUE

0赞 Hieu Nguyen 11/13/2023 #3

这里有一个与你类似的问题。你可以和@jangorecki一起阅读我的答案其中的评论。我的方法是在语言上使用元编程/计算来构造所需的调用：[

让我们从基础开始：

ll <- substitute(
    dt[cond, newColName := construction], # You make a template here with your standard data.table `[` syntax
    list(
        dt = quote(dt.iris), # Replace `dt` in the template with the code `dt.iris`
                             # Use `quote(dt.iris)` to tell R to treat `dt.iris`
                             # as a piece of code instead of the actual dt.iris variable/data.table
        cond = quote(Sepal.Length > 5), # Replace `cond` in the template with the code `Sepal.Length > 5`
        newColName = quote(shortSpecies), # Similar to above
        construction = quote(substr(Species, 1, 5)) # Similar to above
    )
)
cat(deparse(ll), "\n") # dt.iris[Sepal.Length > 5, `:=`(shortSpecies, substr(Species, 1, 5))]
eval(ll) # when you're satisfy with your constructed `[` call, run it with `eval` to get results

您可以看到，您是在 R 代码级别/表达式（即 R 语言上的元编程/计算）而不是字符串操作级别（即，使用字符串函数操作字符串以生成一个字符串，然后、）。这些字符串操作或方法通常是丑陋的、不灵活的（难以适应不断变化的需求）和不安全的（容易受到代码注入的影响），因此您应该避免使用它们。parseevaleval(parse())

让我们做一个函数：

make_new_col_expr <- function(inputDT, cond, newColName, construction) {
    ll <- substitute(
        dt[cond, newColName := construction], # Our template
        list(
            dt = substitute(inputDT), # See details below to understand `substitute` usage here
            cond = if (missing(cond)) substitute() else substitute(cond), # `substitute()` is missing, similar to `quote(expr =)`
            newColName = substitute(newColName),
            construction = substitute(construction)
        )
    )
    cat(deparse(ll), "\n")
    eval(ll)
}
make_new_col_expr(dt.iris, Sepal.Length > 5, shortSpecies, substr(Species, 1, 5))
# dt.iris[Sepal.Length > 5, `:=`(shortSpecies, substr(Species, 1, 5))] 

make_new_col_expr(dt.iris,, shortSpecies, substr(Species, 1, 5))
# dt.iris[, `:=`(shortSpecies, substr(Species, 1, 5))]

要了解的用法，以下是相关引号：substitute(inputDT)

通过检查解析树的每个组件进行替换，如下所示：如果它不是 env 中的绑定符号，则它保持不变。如果它是一个 promise 对象，即函数的正式参数或使用 delayedAssign（）显式创建，则 promise 的表达式槽将替换符号。如果它是一个普通变量，则将其值替换...

--?substitute或 Web

Promise 对象是 R 的惰性求值机制的一部分。它们包含三个槽：值、表达式和环境。调用函数时，参数是匹配的，然后每个形式参数都绑定到一个 promise。为该形式参数提供的表达式和指向调用函数的环境的指针存储在 promise 中。
在访问该参数之前，没有与 promise 关联的值。访问参数时，在存储环境中计算存储的表达式，并返回结果。结果也被承诺保存了。substitute 函数将提取表达式槽的内容。这允许程序员访问与 promise 关联的值或表达式。

--R 语言定义中的 promise 对象

首先，运行 .然后，你进入函数的主体。在这个阶段，变量是一个 promise 对象，其值 slot 为 nothing，expression slot 为，environment slot 为 Global Environment（因为这是调用函数的环境）。所以当你时，结果是（表达式槽）。make_new_col_expr(dt.iris, Sepal.Length > 5, shortSpecies, substr(Species, 1, 5))make_new_col_exprinputDTdt.irissubstitute(inputDT)dt.iris

如果您希望函数的输入是字符串而不是第 2 点所述的表达式，则：

make_new_col_str <- function(inputDT, cond, newColName, construction) {
    ll <- substitute(
        dt[cond, newColName := construction],
        list(
            dt = substitute(inputDT),
            cond = if (missing(cond) || !nzchar(cond)) substitute() else str2lang(cond),
            newColName = str2lang(newColName),
            construction = str2lang(construction)
        )
    )
    cat(deparse(ll), "\n")
    eval(ll)
}
make_new_col_str(dt.iris, "Sepal.Length > 5", "shortSpecies", "substr(Species, 1, 5)")
make_new_col_str(dt.iris, , "shortSpecies", "substr(Species, 1, 5)")

基本上，你用来将字符串转换为表达式。这类似于，但是，使用对读者来说更加简洁明了。话虽如此，这种方法使用字符串，因此，我敦促您避免这种情况并选择上面的第 2 点。str2langparse(text = "...")str2lang

PS：在开始理解与语言计算相关的东西之前，我一直在努力寻找与您在这里的问题类似的问题的解决方案（即在data.table上编程）。substitute

上一个：在 '：=' 中为 'glue（）' 提供 data.table 的环境

下一个：在 R data.table 中填充列

计算 data.table 的 'set（）' 中的 'i'

Evaluating `i` in data.table's `set()`

评论

评论