使用 sed 有效地就地删除大文件头？

Question 1

尝试ed改为：

ed <<< $'1d\nwq' large_file

如果“大”意味着大约 1000 万行或更多，最好使用tail.无法进行就地编辑，但其性能使得这种缺陷可以原谅：

tail -n +2 large_file > large_file.new

编辑显示一些时间差异：

（awk由 Jaypal 添加的代码在同一台机器上具有执行时间（CPU 2.2GHz）。）

bash-4.2$ seq 1000000 > bigfile.txt # further file creations skipped

bash-4.2$ time sed -i 1d bigfile.txt
time 0m4.318s

bash-4.2$ time ed -s <<< $'1d\nwq' bigfile.txt
time 0m0.533s

bash-4.2$ time perl -pi -e 'undef$_ if$.==1' bigfile.txt
time 0m0.626s

bash-4.2$ time { tail -n +2 bigfile.txt > bigfile.new && mv -f bigfile.new bigfile.txt; }
time 0m0.034s

bash-4.2$ time { awk 'NR>1 {print}' bigfile.txt > newfile.txt && mv -f newfile.txt bigfile.txt; }
time 0m0.328s

Answer

尝试ed改为：

ed <<< $'1d\nwq' large_file

如果“大”意味着大约 1000 万行或更多，最好使用tail.无法进行就地编辑，但其性能使得这种缺陷可以原谅：

tail -n +2 large_file > large_file.new

编辑显示一些时间差异：

（awk由 Jaypal 添加的代码在同一台机器上具有执行时间（CPU 2.2GHz）。）

bash-4.2$ seq 1000000 > bigfile.txt # further file creations skipped

bash-4.2$ time sed -i 1d bigfile.txt
time 0m4.318s

bash-4.2$ time ed -s <<< $'1d\nwq' bigfile.txt
time 0m0.533s

bash-4.2$ time perl -pi -e 'undef$_ if$.==1' bigfile.txt
time 0m0.626s

bash-4.2$ time { tail -n +2 bigfile.txt > bigfile.new && mv -f bigfile.new bigfile.txt; }
time 0m0.034s

bash-4.2$ time { awk 'NR>1 {print}' bigfile.txt > newfile.txt && mv -f newfile.txt bigfile.txt; }
time 0m0.328s

Question 2

没有办法有效地从文件开头删除内容。从头开始删除数据需要重写整个文件。

不过，从文件末尾截断可能非常快（操作系统只需调整文件大小信息，可能清除现在未使用的块）。当您尝试从文件头部删除时，这通常是不可能的。

如果您准确地删除整个块/范围，理论上它可能会“快”，但是没有系统调用，因此您必须依赖于文件系统特定的语义（如果存在）。（或者在第一个块/范围内有某种形式的偏移来标记文件的真正开始，我猜。也从未听说过。）

Answer

没有办法有效地从文件开头删除内容。从头开始删除数据需要重写整个文件。

不过，从文件末尾截断可能非常快（操作系统只需调整文件大小信息，可能清除现在未使用的块）。当您尝试从文件头部删除时，这通常是不可能的。

如果您准确地删除整个块/范围，理论上它可能会“快”，但是没有系统调用，因此您必须依赖于文件系统特定的语义（如果存在）。（或者在第一个块/范围内有某种形式的偏移来标记文件的真正开始，我猜。也从未听说过。）

Question 3

最有效的方法，别做！如果这样做，无论如何，您都需要两倍的“大”磁盘空间，并且会浪费 IO。

如果您遇到一个大文件，并且想要在没有第一行的情况下读取该文件，请等到需要读取它以删除第一行为止。如果您需要将文件从 stdin 发送到程序，请使用 tail 来执行此操作：

tail -n +2 | your_program

当您需要读取文件时，您可以趁机删除第一行，但前提是磁盘上有所需的空间：

tail -n +2 | tee large_file2 | your_program

如果您无法从 stdin 读取数据，请使用 fifo：

mkfifo large_file_wo_1st_line
tail -n +2 large_file > large_file_wo_1st_line&
your_program -i large_file_wo_1st_line

如果您使用的是 bash，那就更好了，利用进程替换：

your_program -i <(tail -n +2 large_file)

如果您需要在文件中查找，我认为没有比一开始就陷入文件中更好的解决方案了。如果该文件是由 stdout 生成的：

large_file_generator | tail -n +2 > large_file

否则，总有 fifo 或进程替换解决方案：

mkfifo large_file_with_1st_file
large_file_generator -o large_file_with_1st_file&
tail -n +2 large_file_with_1st_file > large_file_wo_1st_file

large_file_generator -o >(tail -n 2+ > large_file_wo_1st_file)

Answer