从 11gb 单词表文本文件中删除特定行和重复项

Question 1

正如您已经了解的，您需要使用sort -u来删除所有重复的行。sort但是不支持显示进度。

但是，您可以编写一个小脚本，从输入文件读取内容并将其输出到标准输出，同时打印进度。以下是示例：

#!/bin/bash

set -e

bytes_read=0
byte_count=$(wc -c "$1" | cut -d" " -f1)
chunk_size=500000

while read -N $chunk_size chunk
do
    echo -ne "\rRead $bytes_read of $byte_count bytes [$[ 100 * bytes_read / byte_count ]%]" >& 2
    echo -n "$chunk"
    bytes_read=$[ bytes_read + chunk_size ]
done < "$1"

echo >& 2

您可以按如下方式使用此脚本：

./script-name input-file | sort -u > output-file

进度不会考虑sort实际写入输出所需的时间，但这比读取输入文件所需的时间要少得多。这应该是最有效的基于 shell 的解决方案。

Answer

正如您已经了解的，您需要使用sort -u来删除所有重复的行。sort但是不支持显示进度。

但是，您可以编写一个小脚本，从输入文件读取内容并将其输出到标准输出，同时打印进度。以下是示例：

#!/bin/bash

set -e

bytes_read=0
byte_count=$(wc -c "$1" | cut -d" " -f1)
chunk_size=500000

while read -N $chunk_size chunk
do
    echo -ne "\rRead $bytes_read of $byte_count bytes [$[ 100 * bytes_read / byte_count ]%]" >& 2
    echo -n "$chunk"
    bytes_read=$[ bytes_read + chunk_size ]
done < "$1"

echo >& 2

您可以按如下方式使用此脚本：

./script-name input-file | sort -u > output-file

进度不会考虑sort实际写入输出所需的时间，但这比读取输入文件所需的时间要少得多。这应该是最有效的基于 shell 的解决方案。

Question 2

tee和tail -f都是在文件写入时跟踪操作的好命令，但都不能帮助您了解 sort 命令的预计到达时间（也不会让您看到 sort -u 背后发生的情况；只是大部分工作完成后的最终输出）

通过管道输出tee（将同时写入“output_file”和标准输出）：

sort -u input_file | tee output_file

或使用tail -f：

sort -u input_file -o output_file &
tail -f output_file

此外：如果你的输入是预先排序的（正如你的问题所暗示的），而你想要的只是删除相邻的重复行，uniq那么快多了（sort -u我们的 T 恤/尾巴实际上是一种监控进度的有效方法）

uniq input_file | tee output_file

Answer

tee和tail -f都是在文件写入时跟踪操作的好命令，但都不能帮助您了解 sort 命令的预计到达时间（也不会让您看到 sort -u 背后发生的情况；只是大部分工作完成后的最终输出）

通过管道输出tee（将同时写入“output_file”和标准输出）：

sort -u input_file | tee output_file

或使用tail -f：

sort -u input_file -o output_file &
tail -f output_file

此外：如果你的输入是预先排序的（正如你的问题所暗示的），而你想要的只是删除相邻的重复行，uniq那么快多了（sort -u我们的 T 恤/尾巴实际上是一种监控进度的有效方法）

uniq input_file | tee output_file

从 11gb 单词表文本文件中删除特定行和重复项

答案1

答案2

相关内容