删除不在模式列表中的文件

Question 1

我猜想使用起来会更简单、更快GLOBIGNORE（假设你的 shell 是 bash）：

   GLOBIGNORE
          A colon-separated list of patterns defining the set of filenames
          to be ignored by pathname expansion.  If a filename matched by a
          pathname expansion pattern also matches one of the  patterns  in
          GLOBIGNORE, it is removed from the list of matches.

因此，您只需从文件中读取所需的模式，添加一个*使它们成为整体并转换为冒号分隔的列表：

GLOBIGNORE=$(sqlite3 database.sqlite3 'select images from cars_car;' |
             sed 's/|/*:/g; s/$/*/')

然后，您就可以完成rm所有操作并重置 GLOBIGNORE（或者只是关闭当前终端）：

rm * && GLOBIGNORE=""

因为GLOBIGNORE现在看起来像这样：

$ echo $GLOBIGNORE 
5e1adcf7c9c1bcf8842c24f3bacbf169*:5e2497180424aa0d5a61c42162b03fef*

任何与这些 glob 匹配的文件都不会包含在的扩展中*。这还有一个额外的好处，那就是可以处理任何类型的文件名，包括带有空格、换行符或其他奇怪字符的文件名。

Answer

我猜想使用起来会更简单、更快GLOBIGNORE（假设你的 shell 是 bash）：

   GLOBIGNORE
          A colon-separated list of patterns defining the set of filenames
          to be ignored by pathname expansion.  If a filename matched by a
          pathname expansion pattern also matches one of the  patterns  in
          GLOBIGNORE, it is removed from the list of matches.

因此，您只需从文件中读取所需的模式，添加一个*使它们成为整体并转换为冒号分隔的列表：

GLOBIGNORE=$(sqlite3 database.sqlite3 'select images from cars_car;' |
             sed 's/|/*:/g; s/$/*/')

然后，您就可以完成rm所有操作并重置 GLOBIGNORE（或者只是关闭当前终端）：

rm * && GLOBIGNORE=""

因为GLOBIGNORE现在看起来像这样：

$ echo $GLOBIGNORE 
5e1adcf7c9c1bcf8842c24f3bacbf169*:5e2497180424aa0d5a61c42162b03fef*

任何与这些 glob 匹配的文件都不会包含在的扩展中*。这还有一个额外的好处，那就是可以处理任何类型的文件名，包括带有空格、换行符或其他奇怪字符的文件名。

Question 2

在写这个问题的时候，我开始摆弄grep。性能问题的一部分是 grep 对每个文件运行大量的正则表达式搜索。这些很贵。

我们可以使用参数，而不使用正则表达式进行全字符串搜索-F。

find | grep -vFf <(
    sqlite3 database.sqlite3 'select replace(images, CHAR(124), CHAR(10)) from cars_car'
) ### | xargs rm

输出相同，运行时间为 0.045 秒。
旧版本耗时 14.211 秒。

解析的问题之一ls是文件名有问题。muru 在下面的评论强调了在整个管道中使用空字符的一种相当不错的方法。

find -print0 | grep -vzFf <(
    sqlite3 database.sqlite3 'select replace(images, CHAR(124), CHAR(10)) from cars_car'
) ### | xargs -0 rm

我之所以没有改变我的主要答案，是因为我知道我的文件总是干净的，而且我一直在运行这个wc -l以确保我看到正确数量的要删除的文件。

Answer

在写这个问题的时候，我开始摆弄grep。性能问题的一部分是 grep 对每个文件运行大量的正则表达式搜索。这些很贵。

我们可以使用参数，而不使用正则表达式进行全字符串搜索-F。

find | grep -vFf <(
    sqlite3 database.sqlite3 'select replace(images, CHAR(124), CHAR(10)) from cars_car'
) ### | xargs rm

输出相同，运行时间为 0.045 秒。
旧版本耗时 14.211 秒。

解析的问题之一ls是文件名有问题。muru 在下面的评论强调了在整个管道中使用空字符的一种相当不错的方法。

find -print0 | grep -vzFf <(
    sqlite3 database.sqlite3 'select replace(images, CHAR(124), CHAR(10)) from cars_car'
) ### | xargs -0 rm

我之所以没有改变我的主要答案，是因为我知道我的文件总是干净的，而且我一直在运行这个wc -l以确保我看到正确数量的要删除的文件。

Question 3

如果你使用bash作为 shell，则shopt -s extglob可以在 glob 模式中启用更多功能。例如

!(5e1adcf7c9c1bcf8842c24f3bacbf169*|5e2497180424aa0d5a61c42162b03fef*)

将匹配所有不以两个字符串之一开头的名称。

Answer

如果你使用bash作为 shell，则shopt -s extglob可以在 glob 模式中启用更多功能。例如

!(5e1adcf7c9c1bcf8842c24f3bacbf169*|5e2497180424aa0d5a61c42162b03fef*)

将匹配所有不以两个字符串之一开头的名称。

Question 4

我偏向的长期解决方案是更新脚本 (Python/Django) 末尾的某些内容。我有一个 Car 对象列表 — 因此不再需要数据库查询 — 这使得速度更快。这也发生在旧图像不再有用的确切时间。

我使用 Python 是set因为这可能是最快的检查方法。我将要保留的所有图片存根添加到其中，然后遍历缩略图（更容易遍历），并删除不在集合中的文件。

# Generate a python "set" of image stubs
import itertools
imagehashes = set(itertools.chain(*map(lambda c: c.images.split('|'), cars)))

# Check which files aren't in the set and delete
import glob, os
for imhash in map(lambda i: i[25:-7], glob.glob('/path/to/images/*_tn.jpg')):
    if imhash in imagehashes:
        continue

    os.remove('/path/to/images/%s_tn.jpg' % imhash)
    os.remove('/path/to/images/%s.jpg' % imhash)

有一些技巧map可以itertools节省一些时间，但大多是不言自明的。

Answer

我偏向的长期解决方案是更新脚本 (Python/Django) 末尾的某些内容。我有一个 Car 对象列表 — 因此不再需要数据库查询 — 这使得速度更快。这也发生在旧图像不再有用的确切时间。

我使用 Python 是set因为这可能是最快的检查方法。我将要保留的所有图片存根添加到其中，然后遍历缩略图（更容易遍历），并删除不在集合中的文件。

# Generate a python "set" of image stubs
import itertools
imagehashes = set(itertools.chain(*map(lambda c: c.images.split('|'), cars)))

# Check which files aren't in the set and delete
import glob, os
for imhash in map(lambda i: i[25:-7], glob.glob('/path/to/images/*_tn.jpg')):
    if imhash in imagehashes:
        continue

    os.remove('/path/to/images/%s_tn.jpg' % imhash)
    os.remove('/path/to/images/%s.jpg' % imhash)

有一些技巧map可以itertools节省一些时间，但大多是不言自明的。

删除不在模式列表中的文件

答案1

答案2

答案3

答案4

相关内容