如何重新压缩 200 万个 gzip 文件而不存储两次？

Question 1

一个选择可能是使用avfs（这里假设是 GNU 系统）：

mkdir ~/AVFS &&
avfsd ~/AVFS &&
cd ~/AVFS/where/your/gz/files/are/ &&
find . -name '*.gz' -type f -printf '%p#\0' |
  tar --null -T - --transform='s/.gz#$//' -cf - | pigz > /dest/file.tar.gz

Answer

一个选择可能是使用avfs（这里假设是 GNU 系统）：

mkdir ~/AVFS &&
avfsd ~/AVFS &&
cd ~/AVFS/where/your/gz/files/are/ &&
find . -name '*.gz' -type f -printf '%p#\0' |
  tar --null -T - --transform='s/.gz#$//' -cf - | pigz > /dest/file.tar.gz

Question 2

请注意，当涉及到令人讨厌的文件名时，这是脆弱的。

dir_with_small_files=/home/john/files
tmpdir=/tmp/ul/dst
tarfile=/tmp/ul.tar
mkfifo "${tarfile}"

gzip <"${tarfile}" >"${tarfile}.gz" &

find "$dir_with_small_files" -type f | \
while read src; do
    dstdir="${tmpdir}/$(dirname $src)"
    dst="$(basename $src .gz)"
    mkdir -p "$dstdir"
    gunzip <"$src" >"${dstdir}/${dst}"
    # rm "$src" # uncomment to remove the original files
    echo "${dstdir}/${dst}"
done | \
cpio --create --format=ustar -v --quiet 2>&1 >"${tarfile}" | \
while read x; do
    rm "$x"
done

# clean-up
rm "$tarfile"
rm -r "$tmpdir"

文件在下暂时解压缩，一旦添加到存档中，就会立即$tmpdir传递到then 并删除。cpio

Answer

请注意，当涉及到令人讨厌的文件名时，这是脆弱的。

dir_with_small_files=/home/john/files
tmpdir=/tmp/ul/dst
tarfile=/tmp/ul.tar
mkfifo "${tarfile}"

gzip <"${tarfile}" >"${tarfile}.gz" &

find "$dir_with_small_files" -type f | \
while read src; do
    dstdir="${tmpdir}/$(dirname $src)"
    dst="$(basename $src .gz)"
    mkdir -p "$dstdir"
    gunzip <"$src" >"${dstdir}/${dst}"
    # rm "$src" # uncomment to remove the original files
    echo "${dstdir}/${dst}"
done | \
cpio --create --format=ustar -v --quiet 2>&1 >"${tarfile}" | \
while read x; do
    rm "$x"
done

# clean-up
rm "$tarfile"
rm -r "$tmpdir"

文件在下暂时解压缩，一旦添加到存档中，就会立即$tmpdir传递到then 并删除。cpio

Question 3

这是我到目前为止所尝试的 - 它似乎有效，但即使使用 PyPy 也非常慢：

#!/usr/bin/python

import tarfile
import os
import gzip
import sys
import cStringIO

tar = tarfile.open("/dev/stdout", "w|")
for name in sys.stdin:
    name = name[:-1]  # remove the trailing newline
    try:
        f = gzip.open(name)
        b = f.read()
        f.close()
    except IOError:
        f = open(name)
        b = f.read()
        f.close()
    # the [2:] there is to remove ./ from "find" output
    ti = tarfile.TarInfo(name[2:])
    ti.size = len(b)
    io = cStringIO.StringIO(b)
    tar.addfile(ti, io)
tar.close()

用法：find . | script.py | gzip > file.tar.gz

Answer

这是我到目前为止所尝试的 - 它似乎有效，但即使使用 PyPy 也非常慢：

#!/usr/bin/python

import tarfile
import os
import gzip
import sys
import cStringIO

tar = tarfile.open("/dev/stdout", "w|")
for name in sys.stdin:
    name = name[:-1]  # remove the trailing newline
    try:
        f = gzip.open(name)
        b = f.read()
        f.close()
    except IOError:
        f = open(name)
        b = f.read()
        f.close()
    # the [2:] there is to remove ./ from "find" output
    ti = tarfile.TarInfo(name[2:])
    ti.size = len(b)
    io = cStringIO.StringIO(b)
    tar.addfile(ti, io)
tar.close()

用法：find . | script.py | gzip > file.tar.gz

如何重新压缩 200 万个 gzip 文件而不存储两次？

答案1

答案2

答案3

相关内容