使用 awk 每 10,000 行拆分文本文件

Question 1

我建议在里面做所有的家务管理awk，这可以用 GNU awk 来做：

BEGIN { file = "1" }

{ print | "gzip -9 > " file ".gz" }

NR % 10000 == 0 {
  close("gzip -9 > " file ".gz")
  file = file + 1
}

这会将 10000 行保存到1.gz，将接下来的 10000 行保存到2.gz，等等。sprintf如果您希望在文件名生成方面有更大的灵活性，请使用。

已更新，并进行了测试

使用的测试数据是 300k 以下的素数，发现这里。

wc -lc primes; md5sum primes

输出：

25997 196958 primes
547d527ec50c2799fa6ce96dba3c26c0  primes

现在，如果将上面的 awk 程序保存split.awk并像这样运行（使用 GNU awk）：

awk -f split.awk primes

生成了三个文件 (1.gz、2.gz 和 3.gz)。测试这些文件：

for f in {1..3}; do gzip -dc $f.gz >> foo; done

测试：

diff source.file foo

如果文件相同，则输出应该为空。

与上述相同的测试：

gzip -dc [1-3].gz | tee >(wc -lc) >(md5sum) > /dev/null

输出：

25997  196958
547d527ec50c2799fa6ce96dba3c26c0  -

这表明内容相同且文件按预期分割。

Answer

我建议在里面做所有的家务管理awk，这可以用 GNU awk 来做：

BEGIN { file = "1" }

{ print | "gzip -9 > " file ".gz" }

NR % 10000 == 0 {
  close("gzip -9 > " file ".gz")
  file = file + 1
}

这会将 10000 行保存到1.gz，将接下来的 10000 行保存到2.gz，等等。sprintf如果您希望在文件名生成方面有更大的灵活性，请使用。

已更新，并进行了测试

使用的测试数据是 300k 以下的素数，发现这里。

wc -lc primes; md5sum primes

输出：

25997 196958 primes
547d527ec50c2799fa6ce96dba3c26c0  primes

现在，如果将上面的 awk 程序保存split.awk并像这样运行（使用 GNU awk）：

awk -f split.awk primes

生成了三个文件 (1.gz、2.gz 和 3.gz)。测试这些文件：

for f in {1..3}; do gzip -dc $f.gz >> foo; done

测试：

diff source.file foo

如果文件相同，则输出应该为空。

与上述相同的测试：

gzip -dc [1-3].gz | tee >(wc -lc) >(md5sum) > /dev/null

输出：

25997  196958
547d527ec50c2799fa6ce96dba3c26c0  -

这表明内容相同且文件按预期分割。

Question 2

更简短（且更有用）的答案是：您看过 Unixsplit命令吗？

Answer

更简短（且更有用）的答案是：您看过 Unixsplit命令吗？

Question 3

简短的回答是，每次awk读取zcat一个块（一个块为 512 字节，或其倍数，具体取决于您的操作系统）作为其输入（在本例中为管道）。因此，当内存中有第 10000 个换行符（行尾标记）时，内存中也有第 10001 行、第 10002 行，并且很可能还有更多（或可能更少）行。这是一个问题，因为这意味着这些字符已从管道中读出，并且不再可供下一次迭代awk读取。

Answer

简短的回答是，每次awk读取zcat一个块（一个块为 512 字节，或其倍数，具体取决于您的操作系统）作为其输入（在本例中为管道）。因此，当内存中有第 10000 个换行符（行尾标记）时，内存中也有第 10001 行、第 10002 行，并且很可能还有更多（或可能更少）行。这是一个问题，因为这意味着这些字符已从管道中读出，并且不再可供下一次迭代awk读取。

Question 4

您有一个 awk 替代方案。以下是使用 GNU split 或 GNU parallel 执行此操作的方法。

GNU split 有一个--filter选项，手册中描述了与您要尝试执行的操作非常接近的操作：

`--filter=COMMAND'
     With this option, rather than simply writing to each output file,
     write through a pipe to the specified shell COMMAND for each
     output file.  COMMAND should use the $FILE environment variable,
     which is set to a different output file name for each invocation
     of the command.  For example, imagine that you have a 1TiB
     compressed file that, if uncompressed, would be too large to
     reside on disk, yet you must split it into individually-compressed
     pieces of a more manageable size.  To do that, you might run this
     command:

          xz -dc BIG.xz | split -b200G --filter='xz > $FILE.xz' - big-

     Assuming a 10:1 compression ratio, that would create about fifty
     20GiB files with names `big-xaa.xz', `big-xab.xz', `big-xac.xz',
     etc.

因此，就你的情况而言，你可以这样做：

zcat bigfile.gz | split -l 10000 --filter='gzip -9 > $FILE.gz' - big-

拆分的一个好替代方案是使用 GNU parallel，这将允许您并行化压缩：

zcat bigfile.gz | parallel --pipe -N 10000 'gzip > {#}.gz'

Answer

您有一个 awk 替代方案。以下是使用 GNU split 或 GNU parallel 执行此操作的方法。

GNU split 有一个--filter选项，手册中描述了与您要尝试执行的操作非常接近的操作：

`--filter=COMMAND'
     With this option, rather than simply writing to each output file,
     write through a pipe to the specified shell COMMAND for each
     output file.  COMMAND should use the $FILE environment variable,
     which is set to a different output file name for each invocation
     of the command.  For example, imagine that you have a 1TiB
     compressed file that, if uncompressed, would be too large to
     reside on disk, yet you must split it into individually-compressed
     pieces of a more manageable size.  To do that, you might run this
     command:

          xz -dc BIG.xz | split -b200G --filter='xz > $FILE.xz' - big-

     Assuming a 10:1 compression ratio, that would create about fifty
     20GiB files with names `big-xaa.xz', `big-xab.xz', `big-xac.xz',
     etc.

因此，就你的情况而言，你可以这样做：

zcat bigfile.gz | split -l 10000 --filter='gzip -9 > $FILE.gz' - big-

拆分的一个好替代方案是使用 GNU parallel，这将允许您并行化压缩：

zcat bigfile.gz | parallel --pipe -N 10000 'gzip > {#}.gz'

使用 awk 每 10,000 行拆分文本文件

答案1

已更新，并进行了测试

答案2

答案3

答案4

相关内容