提问人:Mike Snare 提问时间:10/24/2023 更新时间:10/25/2023 访问量:65
在 Ruby 中以 json 格式压缩单个字段
Gzip a single field in json in Ruby
问:
假设我有一个 ruby Hash,它有一个属性,是一个非常大的字符串。它是如此之大,以至于压缩字符串可能是有意义的。使用 压缩字符串是微不足道的,但由于编码,将该哈希转换为 JSON 被证明是一个问题。ActiveSupport::Gzip.compress
基本上,此代码失败:
{ compressed: ActiveSupport::Gzip.compress('asdf') }.to_json
出现以下错误:
JSON::GeneratorError: Invalid Unicode [8b 08 00 56 dd] at 1
不包含任何压缩数据的哈希值在转换为 json 时被编码为 UTF-8,但调用失败并出现以下错误:to_json
ActiveSupport::Gzip.compress('asdf').encode('UTF-8')
Encoding::UndefinedConversionError: "\x8B" from ASCII-8BIT to UTF-8
这是我在做的傻瓜差事吗?我的目标能实现吗?
答:
Base-64 对 gzip 输出进行编码,使其可读字符,因此有效的 UTF-8。这将使数据再扩展大约三分之一,从而抵消一些压缩。您还可以使用更多有效字符(例如 Base-85)进行编码,以减少影响,在这种情况下会扩展大约四分之一。通过一些工作,您应该能够将其降低到接近 1/7 的增加。
下面是 C 语言的示例代码,它将字节编码为 1..127 中的符号,这些符号都是有效的 UTF-8。(JSON 不允许字符串中包含 null 字节。由此产生的膨胀系数约为 1.145。
#include <stddef.h>
// Encode, reading six or seven bits from bin[0..len-1] to encode each symbol
// to *enc, where a symbol is one byte in the range 1..127. enc must have room
// for at least ceil((len * 4) / 3) symbols. The average number of encoded
// symbols for random input is 1.1454 * len. The number of encoded symbols is
// returned.
size_t enc127(char *enc, unsigned char const *bin, size_t len) {
unsigned buf = 0;
int bits = 0;
size_t i = 0, k = 0;
for (;;) {
if (bits < 7) {
if (i == len)
break;
buf = (buf << 8) | bin[i++];
bits += 8;
}
unsigned sym = ((buf >> (bits - 7)) & 0x7f) + 1;
if (sym > 0x7e) {
enc[k++] = 0x7f;
bits -= 6;
}
else {
enc[k++] = sym;
bits -= 7;
}
}
if (bits)
enc[k++] = ((buf << (7 - bits)) & 0x7f) + 1;
return k;
}
// Decode, converting each symbol from enc, which must be in the range 1..127,
// into 6 or 7 bits in the output, from which 8 bits at a time is written to
// bin. bin must have room for at least floor((len * 7) / 8) bytes. The number
// of decoded bytes is returned.
size_t dec127(unsigned char *bin, char const *enc, size_t len) {
unsigned buf = 0;
int bits = 0;
size_t k = 0;
for (size_t i = 0; i < len; i++) {
unsigned sym = enc[i];
if (sym == 0x7f) {
buf = (buf << 6) | 0x3f;
bits += 6;
}
else {
buf = (buf << 7) | (sym - 1);
bits += 7;
}
if (bits >= 8) {
bin[k++] = buf >> (bits - 8);
bits -= 8;
}
}
return k;
}
评论