在 Java 中存储一组非常大的数字-解网

问：

我正在尝试存储一组范围从 0 到 ~600 亿的数字，其中该集合开始时是空的，然后逐渐变得更密集，直到它包含该范围内的每个数字。该集合不必能够删除数字。目前，我的方法是将集合表示为一个非常长的布尔数组，并将该数组存储在文本文件中。我为此创建了一个类，并测试了 RandomAccessFile 和 FileChannel，数字范围限制在 0 到 20 亿之间，但在这两种情况下，该类在添加和查询数字方面都比使用常规布尔数组慢得多。这是我班级的当前状态：

import java.io.*;
import java.nio.ByteBuffer;
import java.nio.channels.*;
import java.util.*;
public class FileSet {
    private static final int BLOCK=10_000_000;
    private final long U;
    private final String fname;
    private final FileChannel file;
    public FileSet(long u, String fn) throws IOException {
        U=u;
        fname=fn;
        BufferedOutputStream out=new BufferedOutputStream(new FileOutputStream(fname));
        long n=u/8+1;
        for (long rep=0; rep<n/BLOCK; rep++) out.write(new byte[BLOCK]);
        out.write(new byte[(int)(n%BLOCK)]);
        out.close();
        file=new RandomAccessFile(fn,"rw").getChannel();
    }
    public void add(long v) throws IOException {
        if (v<0||v>=U) throw new RuntimeException(v+" out of range [0,"+U+")");
        file.position(v/8);
        ByteBuffer b=ByteBuffer.allocate(1); file.read(b);
        file.position(v/8);
        file.write(ByteBuffer.wrap(new byte[] {(byte)(b.get(0)|(1<<(v%8)))}));
    }
    public boolean has(long v) throws IOException {
        if (v<0||v>=U) return false;
        file.position(v/8);
        ByteBuffer b=ByteBuffer.allocate(1); file.read(b);
        return ((b.get(0)>>(v%8))&1)!=0;
    }
    public static void main(String[] args) throws IOException {
        long U=2000_000_000;
        SplittableRandom rnd=new SplittableRandom(1);
        List<long[]> actions=new ArrayList<>();
        for (int i=0; i<1000000; i++) actions.add(new long[] {rnd.nextInt(2),rnd.nextLong(U)});

        StringBuilder ret=new StringBuilder(); {
            System.out.println("boolean[]:");
            long st=System.currentTimeMillis();
            boolean[] b=new boolean[(int)U];
            System.out.println("init time="+(System.currentTimeMillis()-st));
            st=System.currentTimeMillis();
            for (long[] act:actions)
                if (act[0]==0) b[(int)act[1]]=true;
                else ret.append(b[(int)act[1]]?"1":"0");
            System.out.println("query time="+(System.currentTimeMillis()-st));
        }

        StringBuilder ret2=new StringBuilder(); {
            System.out.println("FileSet:");
            long st=System.currentTimeMillis();
            FileSet fs=new FileSet(U,"FileSet/"+U+"div8.txt");
            System.out.println("init time="+(System.currentTimeMillis()-st));
            st=System.currentTimeMillis();
            for (long[] act:actions) {
                if (act[0]==0) fs.add(act[1]);
                else ret2.append(fs.has(act[1])?"1":"0");
            }
            System.out.println("query time="+(System.currentTimeMillis()-st));
            fs.file.close();
        }
        if (!ret.toString().equals(ret2.toString())) System.out.println("MISMATCH");
    }
}

和输出：

boolean[]:
init time=1248
query time=148
FileSet:
init time=269
query time=3014

此外，当将范围从 20 亿增加到 100 亿时，查询的总运行时间会大幅增加，尽管理论上总运行时间应大致保持不变。当我单独使用该类时（因为布尔数组不再适用于这么大的范围），查询时间从 ~3 秒到 ~50 秒。当我将范围增加到 600 亿时，时间增加到 ~240 秒。我的问题是：有没有一种更快的方法来访问和修改任意索引的非常大的文件？有没有一种完全不同的方法来存储大型整数集，比我目前的方法更快？

Java IO 设置 NIO

如果集合超过物理内存的大小，则虚拟内存用于存储映射的某些部分，并且在访问位时可能需要磁盘加载。除了购买更多内存或更快的磁盘驱动器外，没有什么可做的。600亿对于随机访问来说是一个巨大的数字，而你已经到了机器的极限。

0赞 AKX 11/15/2021

@markspace 600 亿字节是 60 GB，可以放入内存中。600 亿比特除以 8，即 7.5 GB，现在可以很容易地放入内存中。

1赞 AKX 11/15/2021

您可能希望将 en.wikipedia.org/wiki/Bloom_filter 视为检查集合成员资格的第一阶段......

0赞 user17415259 11/15/2021

是的，您可以存储非常多的集合数据。

答：

0赞 Torben 11/15/2021 #1

Boolean数组是一种非常低效的信息存储方式，因为每个布尔值占用 8 位。您应该改用 a。但 BitSet 也有 20 亿的限制，因为它使用原始 int 值作为参数（并限制内部长数组的大小）。BitSetInteger.MAX_VALUE

跨越超过 20 亿个条目的节省空间的内存中替代方法是创建您自己的 BitSet 包装器，该包装器将数据拆分为子集并为您执行索引：

public class LongBitSet {
    // TODO: Initialize array and add error checking.
    private final BitSet bitSets = new BitSet[64];
    public void set(long index) {
        bitSets[(int) (index / Integer.MAX_VALUE)]
            .set((int) (index % Integer.MAX_VALUE));
    }
}

但也有其他选择。如果数据非常密集，则使用运行长度编码是增加内存容量的廉价方法。但这可能涉及 B 树结构，以使其访问效率更高。这些只是指针。很多正确答案完全取决于你实际使用数据结构的方式。

在使用非常大文件的类中，我实际上使用每个字节来表示集合中 8 个数字的成员/非成员身份，因此当范围设置为 ~600 亿时，我使用的文件的实际大小仅为 ~7.5 GB（我在文件中使用扩展 ASCII）。此外，我的集合开始时是完全空的，然后逐渐变得更密集，直到它包含 0 到 ~600 亿范围内的每个数字，所以我不确定多少运行长度编码 + B 树会提高性能。

0赞 coolcomputery 11/16/2021 #2

事实证明，最简单的解决方案是使用 64 位 JVM 并通过在终端中运行我的 Java 程序来增加 Java 堆空间，并带有类似 .然后，我可以简单地使用一个 s 数组来隐式存储整个集合。-Xmx10glong

上一个：从目录中选择一个文件/文件夹

下一个：Java Nio ByteBuffer 在缓冲区达到其边界时截断 Unicode 字符

在 Java 中存储一组非常大的数字

Storing a very large set of numbers in Java

评论

评论