问：

我有以下数据：

x <- c(F, T, T, T, F, T, T, T, T, T)
names(x) <- letters[1:10]
y <- c(T, F, T, T, T, F, T, T, T, T)
names(y) <- letters[1:10]
z <- c(T, T, F, T, T, T, T, T, F, F)
names(z) <- letters[1:10]
a <- c(T, T, T, T, T, F, T, F, T, T, T, T, T)
names(a) <- letters[1:13]

我想创建一个函数，它可以子集前 5 个连续值，但从后面开始。例如，如果我通过该函数传递对象，我应该得到以下输出：Tx

#    f    g    h    i    j 
# TRUE TRUE TRUE TRUE TRUE

或者，如果我通过它，我应该得到一个.因为后面没有前 5 个值。yNAT

z中间有前 5 个连续值，因此应返回这些值。T

#    d    e    f    g    h 
# TRUE TRUE TRUE TRUE TRUE

在中，有两组 5 个连续值，分别位于开头和结尾。因为，后面的第一组将是末尾的组，因此应该返回这些值。a

#    i    j    k    l    m 
# TRUE TRUE TRUE TRUE TRUE

我怎样才能使这个功能？

R 向量序列

输出

> f(x)
   f    g    h    i    j
TRUE TRUE TRUE TRUE TRUE

> f(y)
[1] NA

> f(z)
   d    e    f    g    h
TRUE TRUE TRUE TRUE TRUE

> f(a)
   i    j    k    l    m
TRUE TRUE TRUE TRUE TRUE

基准

set.seed(42)
x <- sample(c(TRUE, FALSE), size = 1e6, replace = TRUE)
bench::mark(foo(x), last5(x), f_zoo(x), f_gregexpr(x), f_rle(x), f_embed(x))[1:4]
# # A tibble: 6 × 4
#   expression         min   median   `itr/sec`
#   <bch:expr>    <bch:tm> <bch:tm>       <dbl>
# 1 foo(x)           1.9µs    6.2µs 152792.    
# 2 last5(x)         107ms 149.53ms      5.35  
# 3 f_zoo(x)        14.39s   14.39s      0.0695
# 4 f_gregexpr(x) 259.58ms 283.42ms      3.53  
# 5 f_rle(x)         1.94s    1.94s      0.514 
# 6 f_embed(x)    187.22ms 201.41ms      5.04  

# With sparser TRUEs:
x <- sample(c(TRUE, FALSE), size = 1e6, replace = TRUE, prob = c(0.05, 0.95))
bench::mark(foo(x), last5(x), f_zoo(x), f_gregexpr(x), f_rle(x), f_embed(x))[1:4]
# 1 foo(x)         33.12ms  33.36ms    29.0  
# 2 last5(x)       13.11ms   25.5ms    37.9  
# 3 f_zoo(x)         5.14s    5.14s     0.194
# 4 f_gregexpr(x)  75.98ms  76.72ms    12.6  
# 5 f_rle(x)      208.37ms 221.82ms     4.58 
# 6 f_embed(x)     69.01ms  80.64ms    11.9

9赞 MrFlick 8/9/2023 #3

下面是用于计算值运行的解决方案rle

last5 <- function(x) {
  with(rle(x), {
    group <- tail(which(lengths>=5 & values), 1)
    if (length(group)<1) return(NA)
    start <- ifelse(group>1, sum(lengths[1:(group-1)]),0) + (lengths[group]-5)+1
    x[start:(start+4)]
  })  
}

这给出了以下输出

last5(x)
#    f    g    h    i    j 
# TRUE TRUE TRUE TRUE TRUE 
last5(y)
# [1] NA
last5(z)
#    d    e    f    g    h 
# TRUE TRUE TRUE TRUE TRUE 
last5(a)
#    i    j    k    l    m 
# TRUE TRUE TRUE TRUE TRUE

这个想法是，它找到所有具有 5 个以上 TRUE 值的运行，然后获取最后一个组（如果有）并从该组中获取最后 5 个值）

library(zoo)

f <- function(zz) {
  wx <- which(rollapplyr(zz, 5, all, fill = FALSE))
  if (length(wx) > 0) zz[seq(to = max(wx), length = 5)] else NA
}

# tests

f(x)
##    f    g    h    i    j 
## TRUE TRUE TRUE TRUE TRUE 

f(y)
## [1] NA

f(z)
##    d    e    f    g    h 
## TRUE TRUE TRUE TRUE TRUE 

f(a)
##    i    j    k    l    m 
## TRUE TRUE TRUE TRUE TRUE

4赞 Roland 8/10/2023 #5

这是一个展示的机会，这是我最喜欢的 R 函数之一。以下解决方案不太可能比简单循环更有效（这可以通过使用 Rcpp 实现循环轻松实现）。但是，该方法允许将问题扩展到以有效的方式查找所有序列。stats::filterforfor

f_filter <- function(x) {
  x <- rev(x)
  y <- stats::filter(x, rep(1, 5), sides = 1)
  i <- which(y == 5)[1]
  if (is.finite(i)) x[i:(i-4)] else NA
}

f_filter(x)
#   f    g    h    i    j 
#TRUE TRUE TRUE TRUE TRUE 
f_filter(y)
#[1] NA
f_filter(z)
#   d    e    f    g    h 
#TRUE TRUE TRUE TRUE TRUE 
f_filter(a)
#   i    j    k    l    m 
#TRUE TRUE TRUE TRUE TRUE 

set.seed(42)
x <- sample(c(TRUE, FALSE), size = 1e6, replace = TRUE)
bench::mark(foo(x), f_filter(x))[1:4]
#  expression       min   median `itr/sec`
#  <bch:expr>  <bch:tm> <bch:tm>     <dbl>
#1 foo(x)         1.9µs    2.1µs  444340. 
#2 f_filter(x)   19.1ms   19.8ms      49.9

x <- sample(c(TRUE, FALSE), size = 1e6, replace = TRUE, prob = c(0.05, 0.95))
bench::mark(foo(x), f_filter(x))[1:4]
#  expression       min   median `itr/sec`
#  <bch:expr>  <bch:tm> <bch:tm>     <dbl>
#1 foo(x)        42.6ms   43.1ms      23.1
#2 f_filter(x)   18.1ms   18.9ms      52.9

检查矢量背面的 5 个连续 TRUE 值

Check for 5 consecutive TRUE values from the back of a vector

评论

输出

评论

基准

评论

评论