通过 GNU parallel 运行 grep

Question 1

grep是性能方面最精致和时间证明的工具之一...请参见grep与其他文本处理工具在非常大的1G+文件800 万以上行在这里：https://askubuntu.com/a/1420653...另外，适当的(即保留具有正确行顺序的单独文件输出) 文本处理在我看来并不是一个合适的任务，parallel因为正如您所注意到的，它会混合来自不同文件的结果并改变它们的行顺序...虽然您使用了parallel's-k选项来保持与输入相同的输出顺序，但这可能只有在以下情况下才可以按预期工作：

您将并行作业限制为 1，即，-j 1并且--max-procs 1 -P 1。
确保文本按正确的顺序传递，例如通过管道传输实际文本（按正确的顺序)parallel并使用其--pipe选项将文本传送至grepafterwords。

但是，这将违背您并行运行多个作业的预期目的，因此会增加速度增益（如果有的话) 可以忽略不计。

此外，使用for循环将需要grep对循环头部中存在的每个参数/文件进行完全运行，并且每个文件的匹配模式几乎相同......所以，当您试图加快速度时，这可能不是最好的方法......在这种情况下，您最好使用 eggrep的选项--recursive。

但是，您可以通过将循环grep内的每个调用发送for到后台并将其输出重定向到单独的文件，从而在后台运行多个作业，即，grep ... > file1 &稍后如果需要，将生成的输出文件合并到一个输出文件中...这将在后台运行它的多个实例并大大加快循环速度...请看下面的演示。

为了演示目的，我将使用...(sleep N; echo "something" > fileN) &来代替，如果您要向后台发送多个嵌套命令但单个命令不需要，则grep ... > file1 &子 shell 语法是必需的：(...;...)

$ # Creating some background jobs/processes
i=0
for f in file1 file2 file3
  do
  # Start incrementing a counter to use in filenames and calculating sleep seconds.
  ((i++))
  # Send command/s to background
  (sleep $((10*i)); echo "$f $(date)" > "${f}_${i}") &
  # Add background PID to array
  pids+=( "$!" )
  done

# Output:
[1] 31335
[2] 31336
[3] 31338

$ # Monitoring and controling the background jobs/processes
while sleep 5;
  do
  echo "Background PIDs are: ${pids[@]}"  
  for index in "${!pids[@]}"
    do
    if kill -0 "${pids[index]}" &> /dev/null;
      then
      echo "${pids[index]} is running"
      # Do whatever you want here if the process is running ... e.g. kill "${pids[index]}" to kill that process.
      else
      echo "${pids[index]} is not running"
      unset 'pids[index]'
      # Do whatever you want here if the process is not running.
      fi
    done
  if [[ "${#pids[@]}" -eq 0 ]]
    then
    echo "Combined output files contents:"
    cat file*
    unset i
    unset pids
    break
    fi
  done

# Output:
Background PIDs are: 31335 31336 31338
31335 is running
31336 is running
31338 is running
[1]   Done                    ( sleep $((10*i)); echo "$f $(date)" > "${f}_${i}" )
Background PIDs are: 31335 31336 31338
31335 is not running
31336 is running
31338 is running
Background PIDs are: 31336 31338
31336 is running
31338 is running
[2]-  Done                    ( sleep $((10*i)); echo "$f $(date)" > "${f}_${i}" )
Background PIDs are: 31336 31338
31336 is not running
31338 is running
Background PIDs are: 31338
31338 is running
[3]+  Done                    ( sleep $((10*i)); echo "$f $(date)" > "${f}_${i}" )
Background PIDs are: 31338
31338 is not running
Combined output files contents:
file1 Fri Mar 31 12:20:47 AM +03 2023
file2 Fri Mar 31 12:20:57 AM +03 2023
file3 Fri Mar 31 12:21:07 AM +03 2023

另请参阅Bash 作业控制。

Answer

grep是性能方面最精致和时间证明的工具之一...请参见grep与其他文本处理工具在非常大的1G+文件800 万以上行在这里：https://askubuntu.com/a/1420653...另外，适当的(即保留具有正确行顺序的单独文件输出) 文本处理在我看来并不是一个合适的任务，parallel因为正如您所注意到的，它会混合来自不同文件的结果并改变它们的行顺序...虽然您使用了parallel's-k选项来保持与输入相同的输出顺序，但这可能只有在以下情况下才可以按预期工作：

您将并行作业限制为 1，即，-j 1并且--max-procs 1 -P 1。
确保文本按正确的顺序传递，例如通过管道传输实际文本（按正确的顺序)parallel并使用其--pipe选项将文本传送至grepafterwords。

但是，这将违背您并行运行多个作业的预期目的，因此会增加速度增益（如果有的话) 可以忽略不计。

此外，使用for循环将需要grep对循环头部中存在的每个参数/文件进行完全运行，并且每个文件的匹配模式几乎相同......所以，当您试图加快速度时，这可能不是最好的方法......在这种情况下，您最好使用 eggrep的选项--recursive。

但是，您可以通过将循环grep内的每个调用发送for到后台并将其输出重定向到单独的文件，从而在后台运行多个作业，即，grep ... > file1 &稍后如果需要，将生成的输出文件合并到一个输出文件中...这将在后台运行它的多个实例并大大加快循环速度...请看下面的演示。

为了演示目的，我将使用...(sleep N; echo "something" > fileN) &来代替，如果您要向后台发送多个嵌套命令但单个命令不需要，则grep ... > file1 &子 shell 语法是必需的：(...;...)

$ # Creating some background jobs/processes
i=0
for f in file1 file2 file3
  do
  # Start incrementing a counter to use in filenames and calculating sleep seconds.
  ((i++))
  # Send command/s to background
  (sleep $((10*i)); echo "$f $(date)" > "${f}_${i}") &
  # Add background PID to array
  pids+=( "$!" )
  done

# Output:
[1] 31335
[2] 31336
[3] 31338

$ # Monitoring and controling the background jobs/processes
while sleep 5;
  do
  echo "Background PIDs are: ${pids[@]}"  
  for index in "${!pids[@]}"
    do
    if kill -0 "${pids[index]}" &> /dev/null;
      then
      echo "${pids[index]} is running"
      # Do whatever you want here if the process is running ... e.g. kill "${pids[index]}" to kill that process.
      else
      echo "${pids[index]} is not running"
      unset 'pids[index]'
      # Do whatever you want here if the process is not running.
      fi
    done
  if [[ "${#pids[@]}" -eq 0 ]]
    then
    echo "Combined output files contents:"
    cat file*
    unset i
    unset pids
    break
    fi
  done

# Output:
Background PIDs are: 31335 31336 31338
31335 is running
31336 is running
31338 is running
[1]   Done                    ( sleep $((10*i)); echo "$f $(date)" > "${f}_${i}" )
Background PIDs are: 31335 31336 31338
31335 is not running
31336 is running
31338 is running
Background PIDs are: 31336 31338
31336 is running
31338 is running
[2]-  Done                    ( sleep $((10*i)); echo "$f $(date)" > "${f}_${i}" )
Background PIDs are: 31336 31338
31336 is not running
31338 is running
Background PIDs are: 31338
31338 is running
[3]+  Done                    ( sleep $((10*i)); echo "$f $(date)" > "${f}_${i}" )
Background PIDs are: 31338
31338 is not running
Combined output files contents:
file1 Fri Mar 31 12:20:47 AM +03 2023
file2 Fri Mar 31 12:20:57 AM +03 2023
file3 Fri Mar 31 12:21:07 AM +03 2023

另请参阅Bash 作业控制。

Question 2

这是 GNU Parallel 的示例之一：

https://www.gnu.org/software/parallel/parallel_examples.html#example-parallel-grep

如果你一次又一次地 grep 相同的文件，也许这也是有用的： https://stackoverflow.com/a/11913999/363028

Answer

这是 GNU Parallel 的示例之一：

https://www.gnu.org/software/parallel/parallel_examples.html#example-parallel-grep

如果你一次又一次地 grep 相同的文件，也许这也是有用的： https://stackoverflow.com/a/11913999/363028

通过 GNU parallel 运行 grep

答案1

答案2

相关内容