在 Ruby 中以 json 格式压缩单个字段-解网

问：

假设我有一个 ruby Hash，它有一个属性，是一个非常大的字符串。它是如此之大，以至于压缩字符串可能是有意义的。使用压缩字符串是微不足道的，但由于编码，将该哈希转换为 JSON 被证明是一个问题。ActiveSupport::Gzip.compress

基本上，此代码失败：

{ compressed: ActiveSupport::Gzip.compress('asdf') }.to_json

出现以下错误：

JSON::GeneratorError: Invalid Unicode [8b 08 00 56 dd] at 1

不包含任何压缩数据的哈希值在转换为 json 时被编码为 UTF-8，但调用失败并出现以下错误：to_jsonActiveSupport::Gzip.compress('asdf').encode('UTF-8')

Encoding::UndefinedConversionError: "\x8B" from ASCII-8BIT to UTF-8

这是我在做的傻瓜差事吗？我的目标能实现吗？

json ruby gzip

仅压缩（和 base64 编码）一个值会使得使用 JSON 变得困难，接收方现在必须知道只解压缩一个值。它可能会让一切变得更大！我同意 Pascal 的观点，压缩整个内容，或者重新考虑你要发送的内容。我们需要更多的上下文来知道什么是最好的，但这闻起来像一个 XY 问题。

答：

2赞 Mark Adler 10/24/2023 #1

Base-64 对 gzip 输出进行编码，使其可读字符，因此有效的 UTF-8。这将使数据再扩展大约三分之一，从而抵消一些压缩。您还可以使用更多有效字符（例如 Base-85）进行编码，以减少影响，在这种情况下会扩展大约四分之一。通过一些工作，您应该能够将其降低到接近 1/7 的增加。

下面是 C 语言的示例代码，它将字节编码为 1..127 中的符号，这些符号都是有效的 UTF-8。（JSON 不允许字符串中包含 null 字节。由此产生的膨胀系数约为 1.145。

#include <stddef.h>

// Encode, reading six or seven bits from bin[0..len-1] to encode each symbol
// to *enc, where a symbol is one byte in the range 1..127. enc must have room
// for at least ceil((len * 4) / 3) symbols. The average number of encoded
// symbols for random input is 1.1454 * len. The number of encoded symbols is
// returned.
size_t enc127(char *enc, unsigned char const *bin, size_t len) {
    unsigned buf = 0;
    int bits = 0;
    size_t i = 0, k = 0;
    for (;;) {
        if (bits < 7) {
            if (i == len)
                break;
            buf = (buf << 8) | bin[i++];
            bits += 8;
        }
        unsigned sym = ((buf >> (bits - 7)) & 0x7f) + 1;
        if (sym > 0x7e) {
            enc[k++] = 0x7f;
            bits -= 6;
        }
        else {
            enc[k++] = sym;
            bits -= 7;
        }
    }
    if (bits)
        enc[k++] = ((buf << (7 - bits)) & 0x7f) + 1;
    return k;
}

// Decode, converting each symbol from enc, which must be in the range 1..127,
// into 6 or 7 bits in the output, from which 8 bits at a time is written to
// bin. bin must have room for at least floor((len * 7) / 8) bytes. The number
// of decoded bytes is returned.
size_t dec127(unsigned char *bin, char const *enc, size_t len) {
    unsigned buf = 0;
    int bits = 0;
    size_t k = 0;
    for (size_t i = 0; i < len; i++) {
        unsigned sym = enc[i];
        if (sym == 0x7f) {
            buf = (buf << 6) | 0x3f;
            bits += 6;
        }
        else {
            buf = (buf << 7) | (sym - 1);
            bits += 7;
        }
        if (bits >= 8) {
            bin[k++] = buf >> (bits - 8);
            bits -= 8;
        }
    }
    return k;
}

@tadman我不知道你为什么说我不能。我做到了，它运行良好。我测试了很多输入，但找不到失败的输入。我可以对内容进行 gzip 压缩，对结果进行 base64 处理，然后反转该过程，并且我总是会返回输入。这里的解决方案不是“将其标记为 UTF-8”。它正在将它转换为某种东西。至于我为什么需要这个，这并不重要，IMO。是的，数据可以在传输中或静态时进行压缩，但有些代码需要对这些哈希进行操作，而不需要大量字符串的内存开销。

0赞 tadman 10/26/2023

我的意思是 UTF-8 使用的编码与 Gzip 等压缩方法的输出冲突。如果没有额外的编码，则不能同时拥有有效的 UTF-8 字符串和有效的 Gzip。我认为 Bas64 在这里是一个巨大的倒退，真正的解决方案是压缩整个文档，但这是你的决定。我只是鼓励您在许多示例文档上进行基准测试，了解它在未压缩、完全压缩和这种奇怪的部分压缩形式下的表现。

上一个：下一个 13 - 默认为 gzip，但不适用于某些文件

下一个：使用 vba 将 gz 解压缩到 zip 文件时出错

在 Ruby 中以 json 格式压缩单个字段

Gzip a single field in json in Ruby

评论

评论