如何将大文件添加到存档并并行删除它

Question 1

单个文件的未压缩 tar 存档由标头、文件和尾部组成。所以你的主要问题是如何将 512 字节的标头添加到文件的开头。您可以从仅使用标题创建所需的结果开始：

tar cf - bigfile | dd count=1 >bigarchive.tar

然后复制文件的前 10G。简单来说，我们假设您的 dd 一次可以读/写 1Gib：

dd count=10 bs=1G if=bigfile >>bigarchive.tar

我们现在从原始文件中释放复制的数据：

fallocate --punch-hole -o 0 -l 10GiB bigfile

这将数据替换为疏不占用文件系统空间的零。以这种方式继续，将 a 添加skip=10到下一个dd，然后将fallocate起始偏移量增加到-o 10GiB。最后添加一些 nul 字符来填充最终的 tar 文件。

如果您的文件系统不支持，fallocate您可以执行类似的操作，但从文件末尾开始。首先将文件的最后 10Gibytes 复制到一个名为part8.然后使用truncate命令减小原始文件的大小。继续类似的操作，直到您有 8 个文件，每个文件 10Gibyte。然后，您可以连接 header 和part1to bigarchive.tar，然后删除part1，然后连接part2并删除它，依此类推。

Answer

单个文件的未压缩 tar 存档由标头、文件和尾部组成。所以你的主要问题是如何将 512 字节的标头添加到文件的开头。您可以从仅使用标题创建所需的结果开始：

tar cf - bigfile | dd count=1 >bigarchive.tar

然后复制文件的前 10G。简单来说，我们假设您的 dd 一次可以读/写 1Gib：

dd count=10 bs=1G if=bigfile >>bigarchive.tar

我们现在从原始文件中释放复制的数据：

fallocate --punch-hole -o 0 -l 10GiB bigfile

这将数据替换为疏不占用文件系统空间的零。以这种方式继续，将 a 添加skip=10到下一个dd，然后将fallocate起始偏移量增加到-o 10GiB。最后添加一些 nul 字符来填充最终的 tar 文件。

如果您的文件系统不支持，fallocate您可以执行类似的操作，但从文件末尾开始。首先将文件的最后 10Gibytes 复制到一个名为part8.然后使用truncate命令减小原始文件的大小。继续类似的操作，直到您有 8 个文件，每个文件 10Gibyte。然后，您可以连接 header 和part1to bigarchive.tar，然后删除part1，然后连接part2并删除它，依此类推。

Question 2

删除文件并不一定会起到您认为的作用。这就是为什么在类 UNIX 系统中系统调用被称为unlink并不是delete。从手册页：

unlink() deletes a name from the filesystem.  If that name was the last
link to a file and no processes have the file open, the file is deleted
and the space it was using is made available for reuse.

If the name was the last link to a file but any processes still have
the file open, the file will remain in existence until  the  last  file
descriptor referring to it is closed.

因此，只要数据压缩器/归档器正在从文件中读取数据，该文件就仍然存在，并占用文件系统中的空间。

Answer

删除文件并不一定会起到您认为的作用。这就是为什么在类 UNIX 系统中系统调用被称为unlink并不是delete。从手册页：

unlink() deletes a name from the filesystem.  If that name was the last
link to a file and no processes have the file open, the file is deleted
and the space it was using is made available for reuse.

If the name was the last link to a file but any processes still have
the file open, the file will remain in existence until  the  last  file
descriptor referring to it is closed.

因此，只要数据压缩器/归档器正在从文件中读取数据，该文件就仍然存在，并占用文件系统中的空间。

Question 3

如何在将文件添加到存档中的同时删除该文件？

鉴于上下文，我将这个问题解释为：

如何在读取数据后、在读取整个文件之前立即从磁盘中删除数据，以便有足够的空间用于转换后的文件。

转换可以是您想要对数据执行的任何操作：压缩、加密等。

答案是这样的：

<$file gzip | dd bs=$buffer iflag=fullblock of=$file conv=notrunc

简而言之：读取数据，将其放入 gzip（或任何您想用它做的事情）中，缓冲输出，这样我们就可以确保读取的内容比写入的内容多，然后将其写回到文件中。这是一个更漂亮的版本，并且在运行时显示输出：

cat "$file" \
| pv -cN 'bytes read from file' \
| gzip \
| pv -cN 'bytes received from compressor' \
| dd bs=$buffer iflag=fullblock 2>/dev/null \
| pv -cN 'bytes written back to file' \
| dd of="$file" conv=notrunc 2>/dev/null

我将逐行浏览它：

cat "$file"读取您要压缩的文件。这是对 cat (UUOC) 的无用使用，因为下一部分 pv 也可以读取该文件，但我发现这更漂亮。

它将它通过管道传输到pv显示进度信息（-cN告诉它“使用某种[c]ursor”并给它一个[N]ame）。

该管道gzip显然会进行压缩（从标准输入读取，输出到标准输出）。

该管道连接到另一个管道pv（管道视图）。

那个管道进入dd bs=$buffer iflag=fullblock.该$buffer变量是一个数字，例如 50 MB。无论您想要专用多少 RAM 来安全处理文件（作为数据点，2GB 文件使用 50MB 缓冲区就足够了）。指示iflag=fullblock在通过管道之前dd读取最多字节。$buffer一开始，gzip 会写入一个标头，因此 gzip 的输出将落在这一dd行。然后dd将等到有足够的数据后再通过管道传输，以便输入可以进一步读取。此外，如果有不可压缩的部分，输出文件可能比输入文件大。此缓冲区可确保在$buffer最多字节的情况下这不是问题。

然后我们进入另一条管道视图线，最后进入我们的输出dd线。该行已指定of（输出文件）conv=notrunc，其中notrunc告诉dd不要在写入之前截断（删除）输出文件。因此，如果您有 500 字节A并写入 3 字节B，则该文件将是BBBAAAAA...（而不是被取代经过BBB）。

我没有涵盖这些2>/dev/null部分，而且它们是不必要的。他们只是通过抑制dd“我已经完成并写了这么多字节”消息来稍微整理一下输出。每行末尾的反斜杠 ( \) 使 bash 将整个事情视为一个通过管道相互连接的大命令。

这是一个完整的脚本，以方便使用。有趣的是，我把它放在一个名为“gz-in-place”的文件夹中。然后我意识到我制作的缩写：GZIP：gnu zip in-place。所以我在此介绍 GZIP.sh：

#!/usr/bin/env bash

### Settings

# Buffer is how many bytes to buffer before writing back to the original file.
# It is meant to prevent the gzip header from overwriting data, and in case
# there are parts that are uncompressible where the compressor might exceed
# the original filesize. In these cases, the buffer will help prevent damage.
buffer=$((1024*1024*50)) # 50 MiB

# You will need something that can work in stream mode from stdin to stdout.
compressor="gzip"

# For gzip, you might want to pass -9 for better compression. The default is
# (typically?) 6.
compressorargs=""

### End of settings

# FYI I'm aware of the UUOC but it's prettier this way

if [ $# -ne 1 ] || [ "x$1" == "x-h" ] || [ "x$1" == "x--help" ]; then
    cat << EOF
Usage: $0 filename
Where 'filename' is the file to compress in-place.

NO GUARANTEES ARE GIVEN THAT THIS WILL WORK!
Only operate on data that you have backups of.
(But you always back up important data anyway, right?)

See the source for more settings, such as buffer size (more is safer) and
compression level.

The only non-standard dependency is pv, though you could take it out
with no adverse effects, other than having no info about progress.
EOF
    exit 1;
fi;

b=$(($buffer/1024/1024));
echo "Progressing '$1' with ${b}MiB buffer...";
echo "Note: I have no means of detecting this, but if you see the 'bytes read from";
echo "file' exceed 'bytes written back to file', your file is now garbage.";
echo "";

cat "$1" \
| pv -cN 'bytes read from file' \
| $compressor $compressorargs \
| pv -cN 'bytes received from compressor' \
| dd bs=$buffer iflag=fullblock 2>/dev/null \
| pv -cN 'bytes written back to file' \
| dd of="$1" conv=notrunc 2>/dev/null

echo "Done!";

我想再添加一条缓冲线前gzip，以防止在缓冲dd行刷新时写入太远，但只有 50MiB 缓冲区和 1900MB/dev/urandom数据，它似乎已经可以工作了（解压缩后 md5sums 匹配）。对我来说足够好的比例。

另一个改进是检测写得太远，但我不知道如何在不消除事物的美感并创建大量复杂性的情况下做到这一点。到那时，您不妨将其变成一个成熟的 python 程序，可以正确完成所有操作（具有故障保护功能以防止数据破坏）。

Answer