如何提高 bash 脚本中的 cat 和 xargs 性能

Question

所以你当前的逻辑是“对于 1m.txt 中的每一行，查看它是否已经在 advance.txt 中。如果没有，则处理它并将其添加到 out.txt。当作业开始时，使用所有行更新 advance.txt在out.txt'中。

这样做的问题是，随着向 advance.txt 添加的行越多，每行必须比较的行就越多。最坏的情况是，如果每一行都已处理，则需要检查 1m.txt 中的每一百万行，看看它是否是 advance.txt。平均而言，您需要比较 advance.txt 中的一半行，因此这将需要 1,000,000*500,000 或 500,000,000,000（5000 亿）次比较。

如果您没有并行处理事物，则处理此问题的直接方法是找到 out.txt 中的最后一行，并跳过 1m.txt 中到该点的所有行。例如

# Pipe the output of the if/then/else/fi construct to xargs.
# use the if/then/else/fi to select the input.
# Use '-s' to see if the file exists and has non zero size.
 if [ -s out.txt ] ; then
    # we have some existing data
    # Get the host from the last line
    # delete anything that is not the last line
    # remove the DIE/OK. quote anything not alphabetic with a backslash.
   lasthost="$(sed '$!d;s/^\(DIE\|OK\) //;s/[^0-9a-zA-Z]/\\&/g' out.txt)"
   # get the lines from 1m.txt from after the matched host
   # uses GNU sed extension to start at line "0"
   sed "0,/^$lasthost\$/d" 1m.txt
 else
   # no existing data, so just copy the 1m.txt using cat
   cat 1m.txt
 fi | xargs -I {} sh -c "if host {} >/dev/null; then echo OK {}; else echo DIE {}; fi" >> out.txt

然而，您正在并行处理事物。由于host返回值可能需要不同的时间，因此可以对输入进行显着的重新排列。需要一种更快的方法来查看主机是否已被看到。标准方法是使用某种哈希表。一种方法是使用awk.

 if [ -s out.txt ] ; then
    # we have some existing data. Process the two files given
    # for the first file set the entries of the seen array to 1
    # for the second file print out the hosts which have not been seen. 
    awk 'FNR==NR {seen[$2]=1;next} seen[$1]!=1' out.txt 1m.txt
 else
   cat 1m.txt
 fi | xargs -I {} -P 100 sh -c "if host {} >/dev/null; then echo OK {}; else echo DIE {}; fi" >> out.txt

Answer 1

所以你当前的逻辑是“对于 1m.txt 中的每一行，查看它是否已经在 advance.txt 中。如果没有，则处理它并将其添加到 out.txt。当作业开始时，使用所有行更新 advance.txt在out.txt'中。

这样做的问题是，随着向 advance.txt 添加的行越多，每行必须比较的行就越多。最坏的情况是，如果每一行都已处理，则需要检查 1m.txt 中的每一百万行，看看它是否是 advance.txt。平均而言，您需要比较 advance.txt 中的一半行，因此这将需要 1,000,000*500,000 或 500,000,000,000（5000 亿）次比较。

如果您没有并行处理事物，则处理此问题的直接方法是找到 out.txt 中的最后一行，并跳过 1m.txt 中到该点的所有行。例如

# Pipe the output of the if/then/else/fi construct to xargs.
# use the if/then/else/fi to select the input.
# Use '-s' to see if the file exists and has non zero size.
 if [ -s out.txt ] ; then
    # we have some existing data
    # Get the host from the last line
    # delete anything that is not the last line
    # remove the DIE/OK. quote anything not alphabetic with a backslash.
   lasthost="$(sed '$!d;s/^\(DIE\|OK\) //;s/[^0-9a-zA-Z]/\\&/g' out.txt)"
   # get the lines from 1m.txt from after the matched host
   # uses GNU sed extension to start at line "0"
   sed "0,/^$lasthost\$/d" 1m.txt
 else
   # no existing data, so just copy the 1m.txt using cat
   cat 1m.txt
 fi | xargs -I {} sh -c "if host {} >/dev/null; then echo OK {}; else echo DIE {}; fi" >> out.txt

然而，您正在并行处理事物。由于host返回值可能需要不同的时间，因此可以对输入进行显着的重新排列。需要一种更快的方法来查看主机是否已被看到。标准方法是使用某种哈希表。一种方法是使用awk.

 if [ -s out.txt ] ; then
    # we have some existing data. Process the two files given
    # for the first file set the entries of the seen array to 1
    # for the second file print out the hosts which have not been seen. 
    awk 'FNR==NR {seen[$2]=1;next} seen[$1]!=1' out.txt 1m.txt
 else
   cat 1m.txt
 fi | xargs -I {} -P 100 sh -c "if host {} >/dev/null; then echo OK {}; else echo DIE {}; fi" >> out.txt

如何提高 bash 脚本中的 cat 和 xargs 性能

答案1

相关内容