如何对目录中超过 1000 万个文件运行 sed？

Question 1

尝试一下：

find -name '*.txt' -print0 | xargs -0 -I {} -P 0 sed -i -e 's/blah/blee/g' {}

它只会为每次调用提供一个文件名sed。这将解决“sed 的参数太多”问题。该-P选项应允许同时分叉多个进程。如果 0 不起作用（它应该运行尽可能多的进程），请尝试其他数字（10？100？您拥有的核心数？）来限制数量。

Answer

尝试一下：

find -name '*.txt' -print0 | xargs -0 -I {} -P 0 sed -i -e 's/blah/blee/g' {}

它只会为每次调用提供一个文件名sed。这将解决“sed 的参数太多”问题。该-P选项应允许同时分叉多个进程。如果 0 不起作用（它应该运行尽可能多的进程），请尝试其他数字（10？100？您拥有的核心数？）来限制数量。

Question 2

我已经测试过这个方法（以及所有其他方法）10 百万（空）文件，名为“hello 00000001”至“hello 10000000”（每个名称 14 个字节）。

更新： 我现在添加了一个四核在方法上运行'find |xargs'（仍然没有'sed'；只是 echo >/dev/null）..

# Step 1. Build an array for 10 million files
#   * RAM usage approx:  1.5 GiB 
#   * Elapsed Time:  2 min 29 sec 
  names=( hello\ * )

# Step 2. Process the array.
#   * Elapsed Time:  7 min 43 sec
  for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done

以下是提供的答案在针对上述测试数据运行时的表现的总结。这些结果仅涉及基本开销；即未调用“sed”。sed 过程几乎肯定是最耗时的，但我认为看看裸露方法的比较会很有趣。

丹尼斯的方法使用单核，比运行时的方法'find |xargs'多花费 *4 小时 21 分钟** ...但是，“find”提供的多核优势应该超过调用 sed 处理文件时显示的时间差异......bash arrayno sed

           | Time    | RAM GiB | Per loop action(s). / The command line. / Notes
-----------+---------+---------+----------------------------------------------------- 
Dennis     | 271 min | 1.7 GiB | * echo FILENAME >/dev/null
Williamson   cores: 1x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} echo >/dev/null {}
                               | Note: I'm very surprised at how long this took to run the 10 million file gauntlet
                               |       It started processing almost immediately (because of xargs I suppose),  
                               |       but it runs **significantly slower** than the only other working answer  
                               |       (again, probably because of xargs) , but if the multi-core feature works  
                               |       and I would think that it does, then it could make up the defecit in a 'sed' run.   
           |  76 min | 1.7 GiB | * echo FILENAME >/dev/null
             cores: 4x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} -P 0 echo >/dev/null {}
                               |  
-----------+---------+---------+----------------------------------------------------- 
fred.bear  | 10m 12s | 1.5 GiB | * echo FILENAME >/dev/null
                               | $ time names=( hello\ * ) ; time for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done
-----------+---------+---------+----------------------------------------------------- 
l0b0       | ?@#!!#  | 1.7 GiB | * echo FILENAME >/dev/null 
                               | $ time  while IFS= read -rd $'\0' path ; do echo "$path" >/dev/null ; done < <( find "$HOME/junkd" -type f -print0 )
                               | Note: It started processing filenames after 7 minutes.. at this point it  
                               |       started lots of disk thrashing.  'find' was using a lot of memory, 
                               |       but in its basic form, there was no obvious advantage... 
                               |       I pulled the plug after 20 minutes.. (my poor disk drive :(
-----------+---------+---------+----------------------------------------------------- 
intuited   | ?@#!!#  |         | * print line (to see when it actually starts processing, but it never got there!)
                               | $ ls -f hello * | xargs python -c '
                               |   import fileinput
                               |   for line in fileinput.input(inplace=True):
                               |       print line ' 
                               | Note: It failed at 11 min and approx 0.9 Gib
                               |       ERROR message: bash: /bin/ls: Argument list too long  
-----------+---------+---------+----------------------------------------------------- 
Reuben L.  | ?@#!!#  |         | * One var assignment per file
                               | $ ls | while read file; do x="$file" ; done 
                               | Note: It bombed out after 6min 44sec and approx 0.8 GiB
                               |       ERROR message: ls: memory exhausted
-----------+---------+---------+-----------------------------------------------------

Answer

我已经测试过这个方法（以及所有其他方法）10 百万（空）文件，名为“hello 00000001”至“hello 10000000”（每个名称 14 个字节）。

更新： 我现在添加了一个四核在方法上运行'find |xargs'（仍然没有'sed'；只是 echo >/dev/null）..

# Step 1. Build an array for 10 million files
#   * RAM usage approx:  1.5 GiB 
#   * Elapsed Time:  2 min 29 sec 
  names=( hello\ * )

# Step 2. Process the array.
#   * Elapsed Time:  7 min 43 sec
  for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done

以下是提供的答案在针对上述测试数据运行时的表现的总结。这些结果仅涉及基本开销；即未调用“sed”。sed 过程几乎肯定是最耗时的，但我认为看看裸露方法的比较会很有趣。

丹尼斯的方法使用单核，比运行时的方法'find |xargs'多花费 *4 小时 21 分钟** ...但是，“find”提供的多核优势应该超过调用 sed 处理文件时显示的时间差异......bash arrayno sed

           | Time    | RAM GiB | Per loop action(s). / The command line. / Notes
-----------+---------+---------+----------------------------------------------------- 
Dennis     | 271 min | 1.7 GiB | * echo FILENAME >/dev/null
Williamson   cores: 1x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} echo >/dev/null {}
                               | Note: I'm very surprised at how long this took to run the 10 million file gauntlet
                               |       It started processing almost immediately (because of xargs I suppose),  
                               |       but it runs **significantly slower** than the only other working answer  
                               |       (again, probably because of xargs) , but if the multi-core feature works  
                               |       and I would think that it does, then it could make up the defecit in a 'sed' run.   
           |  76 min | 1.7 GiB | * echo FILENAME >/dev/null
             cores: 4x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} -P 0 echo >/dev/null {}
                               |  
-----------+---------+---------+----------------------------------------------------- 
fred.bear  | 10m 12s | 1.5 GiB | * echo FILENAME >/dev/null
                               | $ time names=( hello\ * ) ; time for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done
-----------+---------+---------+----------------------------------------------------- 
l0b0       | ?@#!!#  | 1.7 GiB | * echo FILENAME >/dev/null 
                               | $ time  while IFS= read -rd $'\0' path ; do echo "$path" >/dev/null ; done < <( find "$HOME/junkd" -type f -print0 )
                               | Note: It started processing filenames after 7 minutes.. at this point it  
                               |       started lots of disk thrashing.  'find' was using a lot of memory, 
                               |       but in its basic form, there was no obvious advantage... 
                               |       I pulled the plug after 20 minutes.. (my poor disk drive :(
-----------+---------+---------+----------------------------------------------------- 
intuited   | ?@#!!#  |         | * print line (to see when it actually starts processing, but it never got there!)
                               | $ ls -f hello * | xargs python -c '
                               |   import fileinput
                               |   for line in fileinput.input(inplace=True):
                               |       print line ' 
                               | Note: It failed at 11 min and approx 0.9 Gib
                               |       ERROR message: bash: /bin/ls: Argument list too long  
-----------+---------+---------+----------------------------------------------------- 
Reuben L.  | ?@#!!#  |         | * One var assignment per file
                               | $ ls | while read file; do x="$file" ; done 
                               | Note: It bombed out after 6min 44sec and approx 0.8 GiB
                               |       ERROR message: ls: memory exhausted
-----------+---------+---------+-----------------------------------------------------

Question 3

另一个机会完全安全的发现：

while IFS= read -rd $'\0' path
do
    file_path="$(readlink -fn -- "$path"; echo x)"
    file_path="${file_path%x}"
    sed -i -e 's/blah/blee/g' -- "$file_path"
done < <( find "$absolute_dir_path" -type f -print0 )

Answer

另一个机会完全安全的发现：

while IFS= read -rd $'\0' path
do
    file_path="$(readlink -fn -- "$path"; echo x)"
    file_path="${file_path%x}"
    sed -i -e 's/blah/blee/g' -- "$file_path"
done < <( find "$absolute_dir_path" -type f -print0 )

Question 4

这主要是题外话，但你可以使用

find -maxdepth 1 -type f -name '*.txt' | xargs python -c '
import fileinput
for line in fileinput.input(inplace=True):
    print line.replace("blah", "blee"),
'

这里（相对于）的主要好处... xargs ... -I {} ... sed ...是速度：您避免调用sed1000 万次。如果您可以避免使用 Python（因为 Python 相对来说有点慢），那么速度会更快，因此 perl 可能是这项任务的更好选择。我不确定如何用 perl 方便地完成等效操作。

其工作原理是，xargs将使用单个命令行中所能容纳的尽可能多的参数来调用 Python，并继续执行此操作，直到用完所有参数（由提供ls -f *.txt）。每次调用的参数数量取决于文件名的长度，嗯，还有一些其他内容。该fileinput.input函数从每次调用的参数中命名的文件中生成连续的行，并且选项inplace告诉它神奇地“捕获”输出并使用它来替换每一行。

请注意，Python 的 stringreplace方法不使用正则表达式；如果您需要正则表达式，则必须import re使用print re.sub(line, "blah", "blee")。它们是与 Perl 兼容的正则表达式，是您通过获得的正则表达式的强化版本sed -r。

编辑

正如 akira 在评论中提到的，使用 glob ( ls -f *.txt) 代替find命令的原始版本不起作用，因为 glob 由 shell ( bash) 本身处理。这意味着在运行命令之前，1000 万个文件名将被替换到命令行中。这几乎肯定会超过命令参数列表的最大大小。您可以使用xargs --show-limits系统特定的信息。

还考虑了参数列表的最大大小xargs，根据该限制，限制了传递给每次调用 Python 的参数数量。由于xargs仍需要多次调用 Python，akira 建议使用os.path.walk来获取文件列表，这可能会为您节省一些时间。

Answer