取消连接 gzip 压缩文件

Question 1

将文件 gzip 压缩为单个文件时，gzip会创建一个包含多个 gzip 流的文件，就好像您首先单独压缩文件然后将它们连接起来一样。

此行为在手册页。

-c --stdout --to-stdout

将输出写入标准输出；保持原始文件不变。如果有多个输入文件，则输出由一系列独立压缩的成员组成。

这意味着每个源文件都有一个单独的 gzip 标头（其中包含原始文件名）。所以原则上它们可以在减压时分离。

不幸的是，gzip开发人员选择不支持这一点gunzip：

如果您希望创建包含多个成员的单个存档文件，以便以后可以独立提取成员，请使用 tar 或 zip 等存档程序。 [...] gzip 被设计为 tar 的补充，而不是替代品。

取消连接文件并非易事，因为 gzip 标头和页脚都不包含压缩数据流的长度。这意味着，为了可靠地找到第二个流的开始，您需要解码整个 deflate 数据流，这是解压缩整个数据流的一半。

据我所知，目前还没有工具只能浏览数据流以找出它的结束位置，即使有该领域的一些研究支持对 gzip 压缩文件内容的准随机访问。

幸运的是，一些编程库可以用来单独解压缩 gzip 流，例如 Perl 的IO::Uncompress::Gunzip，正如 Stéphane Chazelas 中提到的他的回答，或铁锈的flate2。

最后，作为解决方案，我编写了该工具枪拉链分裂。它单独解压缩每个文件，也可以解连接文件。对于后者，它会解压缩每个文件，记录 gzip 流开始的偏移量，同时丢弃结果。这可以进一步优化，但即使对于千兆字节大小的文件，工作速度也相当快。

$ ./gunzip-split --help
gunzip-split 0.1.1
Uncompress concatenated gzip files back into separate files.

USAGE:
    gunzip-split [OPTIONS] <FILE>

ARGS:
    <FILE>    concatenated gzip input file

OPTIONS:
    -d, --decompress                      Decompressing all files (default)
    -f, --force                           Overwrite existing files
    -h, --help                            Print help information
    -l, --list-only                       List all contained files instead of decompressing
    -o, --output-directory <DIRECTORY>    Output directory for deconcatenated files
    -s, --split-only                      Split into multiple .gz files instead of decompressing
    -V, --version                         Print version information

$ ./gunzip-split -s -o ./out/ combined.gz
file_1: OK.
file_2: OK.

$ ls ./out
file_1.gz file_2.gz

Answer

将文件 gzip 压缩为单个文件时，gzip会创建一个包含多个 gzip 流的文件，就好像您首先单独压缩文件然后将它们连接起来一样。

此行为在手册页。

-c --stdout --to-stdout

将输出写入标准输出；保持原始文件不变。如果有多个输入文件，则输出由一系列独立压缩的成员组成。

这意味着每个源文件都有一个单独的 gzip 标头（其中包含原始文件名）。所以原则上它们可以在减压时分离。

不幸的是，gzip开发人员选择不支持这一点gunzip：

如果您希望创建包含多个成员的单个存档文件，以便以后可以独立提取成员，请使用 tar 或 zip 等存档程序。 [...] gzip 被设计为 tar 的补充，而不是替代品。

取消连接文件并非易事，因为 gzip 标头和页脚都不包含压缩数据流的长度。这意味着，为了可靠地找到第二个流的开始，您需要解码整个 deflate 数据流，这是解压缩整个数据流的一半。

据我所知，目前还没有工具只能浏览数据流以找出它的结束位置，即使有该领域的一些研究支持对 gzip 压缩文件内容的准随机访问。

幸运的是，一些编程库可以用来单独解压缩 gzip 流，例如 Perl 的IO::Uncompress::Gunzip，正如 Stéphane Chazelas 中提到的他的回答，或铁锈的flate2。

最后，作为解决方案，我编写了该工具枪拉链分裂。它单独解压缩每个文件，也可以解连接文件。对于后者，它会解压缩每个文件，记录 gzip 流开始的偏移量，同时丢弃结果。这可以进一步优化，但即使对于千兆字节大小的文件，工作速度也相当快。

$ ./gunzip-split --help
gunzip-split 0.1.1
Uncompress concatenated gzip files back into separate files.

USAGE:
    gunzip-split [OPTIONS] <FILE>

ARGS:
    <FILE>    concatenated gzip input file

OPTIONS:
    -d, --decompress                      Decompressing all files (default)
    -f, --force                           Overwrite existing files
    -h, --help                            Print help information
    -l, --list-only                       List all contained files instead of decompressing
    -o, --output-directory <DIRECTORY>    Output directory for deconcatenated files
    -s, --split-only                      Split into multiple .gz files instead of decompressing
    -V, --version                         Print version information

$ ./gunzip-split -s -o ./out/ combined.gz
file_1: OK.
file_2: OK.

$ ls ./out
file_1.gz file_2.gz

Question 2

碰巧， ingzip -c file1 file2 > result确实gzip创建了两个独立的压缩流，每个文件一个，甚至还存储文件的文件名和修改时间。

它不允许您在解压时使用该信息，但您可以使用perl'sIO::Uncompress::Gunzip模块来执行此操作。例如：

#! /usr/bin/perl
use IO::Uncompress::Gunzip;

$z = IO::Uncompress::Gunzip->new("-");

do {
  $h = $z->getHeaderInfo() or die "can't get headerinfo";
  open $out, ">", $h->{Name} or die "can't open $h->{Name} for writing";
  print $out $buf while $z->read($buf) > 0;
  close $out;
  utime(undef, $h->{Time}, $h->{Name}) or warn "can't update $h->{Name}'s mtime";
} while $z->nextStream;

并调用该脚本作为，它将恢复当前工作目录中的that-script < exlogs.gz文件及其原始名称和修改时间（不包括未存储的亚秒部分）。gzip

Answer

碰巧， ingzip -c file1 file2 > result确实gzip创建了两个独立的压缩流，每个文件一个，甚至还存储文件的文件名和修改时间。

它不允许您在解压时使用该信息，但您可以使用perl'sIO::Uncompress::Gunzip模块来执行此操作。例如：

#! /usr/bin/perl
use IO::Uncompress::Gunzip;

$z = IO::Uncompress::Gunzip->new("-");

do {
  $h = $z->getHeaderInfo() or die "can't get headerinfo";
  open $out, ">", $h->{Name} or die "can't open $h->{Name} for writing";
  print $out $buf while $z->read($buf) > 0;
  close $out;
  utime(undef, $h->{Time}, $h->{Name}) or warn "can't update $h->{Name}'s mtime";
} while $z->nextStream;

并调用该脚本作为，它将恢复当前工作目录中的that-script < exlogs.gz文件及其原始名称和修改时间（不包括未存储的亚秒部分）。gzip

Question 3

这有点复杂，但在满足以下要求时有效：

这merged.gz是清晰的 ASCII 数据和 gzip 压缩文件的混合体
它来自像这样的操作cat log0 log1.gz log2.gz log3 log4.gz > merged.gz
明文 ASCII 文件中的行仅来自可打印字符
gzip 压缩文件的魔术字节完好无损（以十六进制表示1F 8B）

大多数程序应该可用，sponge可以moreutils通过手动写入临时文件来避免。

做了什么：

将具有专用可打印字符的行放入每个连续块的文件中。请注意，如果您连续合并两个清晰的 ASCII 文件，这不会将它们分开（在这种情况下使用日志的时间戳来分隔文件）并且原始文件名会丢失
将其他行放入中间gz_only.gz文件中
使用魔术字节来分隔文件

最后一点使用csplit，它只能在还有换行符的情况下进行分割 - 因此这是在分割之前引入并在分割之后删除的。目前假设合并系统中的 gzip 压缩文件不超过 1000 个。

#!/bin/bash

#lines with printable characters go to separate files for each consecutive block
awk '{ if ($0 ~ /^[[:print:]]+$/) { print > "file_"i+0}
       else {if (oldi==i) {i++}}}' merged.gz

#get lines with non-printables to other merged file
grep -av '^[[:print:]]$' merged.gz > gz_only.gz

#split into files and remember their count
#sed introduces newline before magic bytes
#csplit splits on occurrence of magic bytes and returns info on file lengths
nfiles=$( sed "s/$(printf '\x1f\x8b')/\n&/g" gz_only.gz |
          csplit - -z "/$(printf '\x1f\x8b')/" '{*}' -b'%03d.gz' |
          wc -l )

#first file is empty, due to introduced newline
rm -fv xx000.gz

#for all other remove newline
#note: the above grep introduced a newline to the last file
#if splitting is done for a file only concatenated from
#gz-files (no previous grep), the last file would have to
#be excluded from this operation.
for (( i=1 ; i<nfiles ; i++ )) ; do
    name=xx$(printf '%03d.gz' $i)
    head -c -1 $name | sponge $name
done

#retrieve original file name
for f in xx*gz ; do
    #this is ready for simple filenames like the suggested logs,
    #e.g. no " as file name character
    mv $f "$(file $f | awk -F'"' '{print $2}').gz"
done

#unzip files
find -name '*gz' ! -name gz_only.gz ! -name merged.gz -exec gunzip {} +

我有点觉得使用 ASCII 和非 ASCII 的分离以及分割可能会更优雅perl，但我不熟悉。

Answer