从矩阵中提取最大的 n 个值及其索引-解网

问：

我有一个大约为 150'000 x 150'000 的矩阵，我需要从中提取最大的 n 个值及其索引。n 也会很大，介于 1000 万到 5000 万之间。

我不能在“正常”（如）中执行此操作，因为如果我将矩阵转换为长格式，它将有超过 2^31 行。有人可以帮助实现这一点吗？Rdata.tableRcpp

我在这里找到了一个版本，不幸的是它没有给出索引，这对我来说很重要。该函数应该返回一些东西，我可以将其转换为格式为 [（行索引）、（列索引）、（值）]。

我将不胜感激。

R 矩阵最大 RCPP

在这种情况下，欢迎学习足够的C++来做到这一点:)（我刚刚跳上了一台具有 64gb 内存的机器，之前使用并分配了一个 150k x 50k 的整数（一半大小）矩阵（使用形式 'n， k， arma：：fill：：none'）的 30gb。但是，由于你的矩阵与 R 不兼容，因此您必须自己编写适合 R 世界的自定义小插入器。简而言之，“可能”，但远非灌篮高手。#define ARMA_64BIT_WORD#include <RcppArmadillo.h>

1赞 Nils R 11/15/2023

@zx8754我检查了您提到的两个包：当我尝试转换矩阵时，bigmemory 包抛出错误“尚不支持长向量”。据我所知，iCAMP 包只提供 1 个最大值，而我需要 ~1000 万。所以可悲的是，两者似乎都不是解决方案

1赞 Maël 11/15/2023

如果效率是一个问题，您可以尝试 or 函数kit::topncollapse::nth

答：

4赞 user2554330 11/15/2023 #1

你的矩阵只有 10 倍多一点 2^31，所以你可以把它分成 11 个或更多部分，找到每个部分的前 n 个值，然后合并这些子集并找到总体上的前 n 个值。

例如，使用小得多的矩阵：

set.seed(123)
dat <- matrix(rnorm(10000), 100,100)

n <- 10    
parts <- 5

# Calculate the size of each part.  I'll assume it's 
# an integer; it's just a little more complicated if not
len <- length(dat)/parts  

# Record the original rows and columns.  We'll be
# treating everything as vectors, but in standard R
# matrices that doesn't require any operations.  If
# you're using dataframes or tibbles or data tables
# you will need to do a conversion.

row <- row(dat)
col <- col(dat)
result <- NULL

for (i in 1:parts) {
  subset <- seq_len(len) + (i-1)*len
  o <- order(dat[subset], decreasing = TRUE)
  keep <- subset[o[1:n]]
  result <- rbind(result, 
                  data.frame(value = dat[keep],
                             row = row[keep],
                             col = col[keep]))
}

# Now get the final value
o <- order(result$value, decreasing = TRUE)
result <- result[o[1:n], ]
result
#>       value row col
#> 41 3.847768  56  82
#> 21 3.715721  95  60
#> 22 3.445992   5  45
#> 11 3.421095  82  30
#> 42 3.397894  61  82
#> 1  3.390371  24  14
#> 2  3.290517  22  17
#> 31 3.275908  36  68
#> 23 3.271783  76  44
#> 3  3.241040  64   2

# Check it:
max(dat)
#> [1] 3.847768
dat[result$row[1], result$col[1]]
#> [1] 3.847768

^{创建于 2023-11-15 with reprex v2.0.2}

find_and_replace_max <- function(m, runs = 10, indices_max = NULL, values_max = NULL){
  if(runs > 0){
    index_max <- which.max(m)
    indices_max <- c(indices_max, index_max)
    values_max <- c(values_max, max(m))
    m[index_max] <- -Inf
    find_and_replace_max(m, runs - 1, indices_max, values_max)    
  } else {
    data.frame(index = indices_max, value = values_max)
  }
}

例：

## create a 1000 x 1000 matrix:
set.seed(123)
n <- 1e3
m <- 1e3

## do ten runs for the top 10:
find_and_replace_max(matrix(rnorm(n*m), n, m), runs = 10)

输出：

>     index    value
1  661312 4.850767
2  651832 4.790390
3  384839 4.759086
4  989980 4.718727
5  309215 4.599884
6  310358 4.560770
7  234371 4.521510
8  599704 4.492426
9  206891 4.438207
10 290630 4.335663

3赞 jblood94 11/15/2023 #4

统计方法在这里应该很有效：

对矩阵的元素进行相对大但可管理的样本（例如，）。1e6
计算该值将为您提供很大的概率（例如，99%），即最大样本值小于总体中的第 th 大值。kkn
获取矩阵中大于步骤 2 中找到的值的值的线性索引。
如果步骤 3 中返回的索引数小于，则返回到步骤 1。否则，返回在步骤 3 中找到其索引的第 1 个最大值的线性索引。nn

步骤 2 和 4 中使用的示例实现：Rfast::nth

library(Rfast)

topn.idx <- function(x, n) {
  y <- nth(x[sample(length(x), 1e6)], qbinom(0.99, 1e6, n/length(x)), 1, TRUE)
  i <- which(x > y)
  if (length(i) < n) Recall(x, n) else i[nth(x[i], n, n, TRUE, TRUE)]
}

在具有足够 RAM 的老化笔记本电脑上按矩阵使用的示例：2^303

set.seed(28404888)
x <- matrix(0, 2^30, 3)
for (i in 1:3) x[,i] <- rexp(2^30)
system.time({idx <- topn.idx(x, 3e6L)})
#>    user  system elapsed 
#>    9.90  118.50  310.62
length(idx)
#> [1] 3000000
range(x[idx])
#> [1]  6.979972 22.252548

可以通过以下方式获得所需的：data.table

dtIdx <- data.table(rI = idx - 1L)[
  , `:=`(rI = rI%%nrow(x) + 1L, cI = rI%/%nrow(x) + 1L, v = x[idx])
]

通过优化样本数量和捕获第 th 个最大值的概率可能会获得额外的收益，但我怀疑它们会相当温和。n

其他基准测试

有趣的是，在元素少于 2^31 的大型矩阵上比（这比快）。topn.idxRfast::nthkit::topn

library(kit)
set.seed(28404888)

x <- matrix(rexp(1e8), 1e4, 1e4)
idx <- sort(topn.idx(x, 1e6))
identical(idx, sort(as.integer(nth(x, 1e6, 1e6, TRUE, TRUE))))
#> [1] TRUE
identical(idx, sort(topn(x, 1e6, index = TRUE)))
#> [1] TRUE
bench::mark(
  topn.idx = topn.idx(x, 1e6),
  kit = topn(x, 1e6, index = TRUE),
  Rfast = nth(x, 1e6, 1e6, TRUE, TRUE),
  min_iterations = 10,
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 topn.idx   460.32ms 482.57ms     1.91      825MB   2.10  
#> 2 kit           5.93s     6.6s     0.153     385MB   0.0765
#> 3 Rfast         1.86s     1.9s     0.524     771MB   0.524

从矩阵中提取最大的 n 个值及其索引

Extract largest n values and their indices from a matrix

评论

评论

评论

其他基准测试

评论