计算大型 GZIPPED 文件未压缩大小的最快方法

Question 1

我相信最快的方法是进行修改，gzip以便在详细模式下测试输出解压缩的字节数；在我的系统上，有一个 7761108684 字节的文件，我得到

% time gzip -tv test.gz
test.gz:     OK (7761108684 bytes)
gzip -tv test.gz  44.19s user 0.79s system 100% cpu 44.919 total

% time zcat test.gz| wc -c
7761108684
zcat test.gz  45.51s user 1.54s system 100% cpu 46.987 total
wc -c  0.09s user 1.46s system 3% cpu 46.987 total

要修改gzip（1.6，在Debian中可用），补丁如下：

--- a/gzip.c
+++ b/gzip.c
@@ -61,6 +61,7 @@
 #include <stdbool.h>
 #include <sys/stat.h>
 #include <errno.h>
+#include <inttypes.h>
 
 #include "closein.h"
 #include "tailor.h"
@@ -694,7 +695,7 @@
 
     if (verbose) {
         if (test) {
-            fprintf(stderr, " OK\n");
+            fprintf(stderr, " OK (%jd bytes)\n", (intmax_t) bytes_out);
 
         } else if (!decompress) {
             display_ratio(bytes_in-(bytes_out-header_bytes), bytes_in, stderr);
@@ -901,7 +902,7 @@
     /* Display statistics */
     if(verbose) {
         if (test) {
-            fprintf(stderr, " OK");
+            fprintf(stderr, " OK (%jd bytes)", (intmax_t) bytes_out);
         } else if (decompress) {
             display_ratio(bytes_out-(bytes_in-header_bytes), bytes_out,stderr);
         } else {

类似的方法已在中实现gzip，并将包含在 1.11 之后的版本中；gzip -l现在解压缩数据以确定其大小。

Answer

我相信最快的方法是进行修改，gzip以便在详细模式下测试输出解压缩的字节数；在我的系统上，有一个 7761108684 字节的文件，我得到

% time gzip -tv test.gz
test.gz:     OK (7761108684 bytes)
gzip -tv test.gz  44.19s user 0.79s system 100% cpu 44.919 total

% time zcat test.gz| wc -c
7761108684
zcat test.gz  45.51s user 1.54s system 100% cpu 46.987 total
wc -c  0.09s user 1.46s system 3% cpu 46.987 total

要修改gzip（1.6，在Debian中可用），补丁如下：

--- a/gzip.c
+++ b/gzip.c
@@ -61,6 +61,7 @@
 #include <stdbool.h>
 #include <sys/stat.h>
 #include <errno.h>
+#include <inttypes.h>
 
 #include "closein.h"
 #include "tailor.h"
@@ -694,7 +695,7 @@
 
     if (verbose) {
         if (test) {
-            fprintf(stderr, " OK\n");
+            fprintf(stderr, " OK (%jd bytes)\n", (intmax_t) bytes_out);
 
         } else if (!decompress) {
             display_ratio(bytes_in-(bytes_out-header_bytes), bytes_in, stderr);
@@ -901,7 +902,7 @@
     /* Display statistics */
     if(verbose) {
         if (test) {
-            fprintf(stderr, " OK");
+            fprintf(stderr, " OK (%jd bytes)", (intmax_t) bytes_out);
         } else if (decompress) {
             display_ratio(bytes_out-(bytes_in-header_bytes), bytes_out,stderr);
         } else {

类似的方法已在中实现gzip，并将包含在 1.11 之后的版本中；gzip -l现在解压缩数据以确定其大小。

Question 2

gzip 格式仅以 4 个字节（文件的最后 4 个字节）存储未压缩的大小，因此存储的未压缩的大小实际上是模 2**32 (4GiB) 的大小。如果实际未压缩大小小于 4GiB，则该值将是正确的，但对于大于 4GiB 的未压缩文件，获得正确值的唯一方法是读取整个文件。但是可以估计的（Python代码如下）！

如果压缩后的大小大于未压缩的大小，则未压缩的大小可能大于 4GiB，我们尝试通过向左侧添加“1”位来猜测正确的大小，直到新的大小大于压缩的大小并大于 4GiB。请注意，这种猜测可能是错误的，原因有两个：

在某些情况下，压缩后的大小可能大于未压缩的大小（例如尝试压缩已压缩的文件）；或者
对于非常大的文件，我们不断地将位“1”向左移动几次，这会在数字“1”和原始 32 位之间形成一个“洞”（例如：移动 5 次导致 10000X，其中 X 是原始 32 位）。返回的值是未压缩文件的最小预期大小，因为如果不读取整个文件，就无法正确“填补漏洞”。

下面是一个 Python 代码，用于估算未压缩的大小我的行项目:

import os
import struct

def estimate_gzip_uncompressed_size(filename):
    compressed_size = os.stat(filename).st_size
    with open(filename, mode="rb") as fobj:
        fobj.seek(-4, 2)
        uncompressed_size = struct.unpack("<I", fobj.read())[0]
    if compressed_size > uncompressed_size:
        i, value = 32, uncompressed_size
        while value <= 2**32 and value < compressed_size:
            value = (1 << i) ^ uncompressed_size
            i += 1
        uncompressed_size = value
    return uncompressed_size

我在实现进度条来报告将 gzipped CSV 文件导入到 PostgreSQL 时遇到了麻烦rows pgimport，因此我编写了上面的函数来估计实际大小（程序会知道估计值是否错误，因为它正在读取整个文件，然后它只是用正确的“新”值更新进度条）。

笔记：使用gzip --list <filename>来获取未压缩的大小对我来说不是一个选择，因为：

在 2.12 版本之前，该命令运行速度很快，但报告了错误的未压缩大小（它只读取最后 4 个字节）；和
版本 2.12 通过读取整个文件（只是为了打印未压缩的大小！）修复了这个错误 - 这不是一个选项，因为文件很大并且需要很多时间。来自2.12 发行说明:

“gzip -l”不再错误报告 4 GiB 及更大的文件长度。以前，“gzip -l”输出存储在 gzip 标头中的 32 位值，即使该值是未压缩长度模 2**32。现在，“gzip -l”通过解压缩数据并计算结果字节来计算未压缩长度。尽管这可能需要更多时间，但现在正确性的优点似乎超过了性能的缺点。

Answer

gzip 格式仅以 4 个字节（文件的最后 4 个字节）存储未压缩的大小，因此存储的未压缩的大小实际上是模 2**32 (4GiB) 的大小。如果实际未压缩大小小于 4GiB，则该值将是正确的，但对于大于 4GiB 的未压缩文件，获得正确值的唯一方法是读取整个文件。但是可以估计的（Python代码如下）！

如果压缩后的大小大于未压缩的大小，则未压缩的大小可能大于 4GiB，我们尝试通过向左侧添加“1”位来猜测正确的大小，直到新的大小大于压缩的大小并大于 4GiB。请注意，这种猜测可能是错误的，原因有两个：

在某些情况下，压缩后的大小可能大于未压缩的大小（例如尝试压缩已压缩的文件）；或者
对于非常大的文件，我们不断地将位“1”向左移动几次，这会在数字“1”和原始 32 位之间形成一个“洞”（例如：移动 5 次导致 10000X，其中 X 是原始 32 位）。返回的值是未压缩文件的最小预期大小，因为如果不读取整个文件，就无法正确“填补漏洞”。

下面是一个 Python 代码，用于估算未压缩的大小我的行项目:

import os
import struct

def estimate_gzip_uncompressed_size(filename):
    compressed_size = os.stat(filename).st_size
    with open(filename, mode="rb") as fobj:
        fobj.seek(-4, 2)
        uncompressed_size = struct.unpack("<I", fobj.read())[0]
    if compressed_size > uncompressed_size:
        i, value = 32, uncompressed_size
        while value <= 2**32 and value < compressed_size:
            value = (1 << i) ^ uncompressed_size
            i += 1
        uncompressed_size = value
    return uncompressed_size

我在实现进度条来报告将 gzipped CSV 文件导入到 PostgreSQL 时遇到了麻烦rows pgimport，因此我编写了上面的函数来估计实际大小（程序会知道估计值是否错误，因为它正在读取整个文件，然后它只是用正确的“新”值更新进度条）。

笔记：使用gzip --list <filename>来获取未压缩的大小对我来说不是一个选择，因为：

在 2.12 版本之前，该命令运行速度很快，但报告了错误的未压缩大小（它只读取最后 4 个字节）；和
版本 2.12 通过读取整个文件（只是为了打印未压缩的大小！）修复了这个错误 - 这不是一个选项，因为文件很大并且需要很多时间。来自2.12 发行说明:

“gzip -l”不再错误报告 4 GiB 及更大的文件长度。以前，“gzip -l”输出存储在 gzip 标头中的 32 位值，即使该值是未压缩长度模 2**32。现在，“gzip -l”通过解压缩数据并计算结果字节来计算未压缩长度。尽管这可能需要更多时间，但现在正确性的优点似乎超过了性能的缺点。

Question 3

正如其他答案所提到的，这是不可能的。我能想到的唯一情况是压缩文件本身小于 4 GiB / 1032 = 3.97 MiB。因为只有这样我们才能确保未压缩的大小不会溢出 gzip 页脚中存储的 32 位“大小”。 1032 是最大压缩比。

或者，您可以使用多个线程来加快解压速度并从而加快计数速度。为此，我写了快速gzip。它可以在 PyPI 上使用，但也可以从源代码构建：

python3 -m pip install --user rapidgzip

或者对于我用于这些基准测试的最新版本：

git clone https://github.com/mxmlnkn/rapidgzip
cd rapidgzip && mkdir build && cd build
cmake .. && make rapidgzip

Ryzen 3900X 的基准测试结果（12 个物理核心/24 个虚拟核心）：

解码器	线路	运行时间/秒	带宽/（MB/秒）
快速gzip- 数数	4294967296	0.589	7292
快速gzip-c`\|`厕所 -c	4294967296	1.279	3358
压缩包	4294967296	9.088	第473章
猪猪	4294967296	13.230	325
压缩包	4294967296	22.167	194

这要求文件驻留在非常快的 SSD 上或缓存在内存中。如果文件驻留在旋转磁盘上，I/O 首先会成为性能瓶颈。igzip已经比大多数 HDD 更快，甚至还没有完全并行化

基准脚本

sudo apt install pigz isal
python3 -m pip install --user --upgrade rapidgzip
# Create a compressible random file
base64 /dev/urandom | head -c $(( 4 * 1024 * 1024 * 1024 )) > 4GiB-base64
gzip -c 4GiB-base64

fileSize=$( stat -L --format=%s 4GiB-base64 )
printf '\n| %7s | %8s | %10s | %18s |\n' Decoder Lines 'Runtime / s' 'Bandwidth / (MB/s)'
printf -- '|---------|----------|-------------|--------------------|\n'

countedBytes=$( src/tools/rapidgzip --count "4GiB-base64.gz" )
runtime=$( ( time src/tools/rapidgzip --count "4GiB-base64.gz" ) 2>&1 | sed -n -E 's|real[ \t]*0m||p' | sed 's|[ \ts]||' )
bandwidth=$( python3 -c "print( int( round( $fileSize / 1e6 / $runtime ) ) )" )
printf '| %7s | %8s | %11s | %18s |\n' "rapidgzip --count" "$countedBytes" "$runtime" "$bandwidth"

for tool in src/tools/rapidgzip igzip pigz gzip; do
    countedBytes=$( $tool -d -c "4GiB-base64.gz" | wc -c )
    runtime=$( ( time $tool -d -c "4GiB-base64.gz" | wc -c ) 2>&1 | sed -n -E 's|real[ \t]*0m||p' | sed 's|[ \ts]||' )
    bandwidth=$( python3 -c "print( int( round( $fileSize / 1e6 / $runtime ) ) )" )
    printf '| %7s | %8s | %11s | %18s |\n' "$tool" "$countedBytes" "$runtime" "$bandwidth"
done

Answer