从 bash 脚本执行 cmp - 对现有文件“没有这样的文件或目录”

Question

您的主要问题是在使用时向文件名添加文字双引号cmp（并且文件名本身实际上未加引号）。这就是找不到文件的原因（它们的名称中没有引号）。您还可以循环输出find，这并不理想。

如果你真的不想使用fdupes，您可以执行以下操作来尝试使您的方法更加高效（速度方面）：

#!/bin/bash

# enable the "**" globbing pattern,
# remove non-matched patterns rather than keeping them unexpanded, and
# allow the matching of hidden names:
shopt -s globstar nullglob dotglob

pathnames=("${1:-.}"/**)  # all pathnames beneath "$1" (or beneath "." if "$1" is empty)

# loop over the indexes of the list of pathnames
for i in "${!pathnames[@]}"; do
    this=${pathnames[i]}  # the current pathname

    # skip this if it's not a regular file (or a symbolic link to one)
    [[ ! -f "$this" ]] && continue

    # loop over the remainder of the list
    for that in "${pathnames[@]:i+1}"; do

        # skip that if it's not a regular file (or a symbolic link to one)
        [[ ! -f "$that" ]] && continue

        # compare and print if equal
        if [[ "$this" -ef "$that" ]] || cmp -s "$this" "$that"; then
            printf '"%s" and "%s" contains the same thing\n' "$this" "$that"
        fi
    done
done

这可以避免为每个文件遍历整个目录结构一次（您在内部循环中执行此操作），并且还可以避免多次比较对。它还是非常速度很慢，因为它需要cmp在整个目录层次结构中的每个文件组合上运行。

相反，您可能想尝试一种更简单的方法：

#!/bin/bash

tmpfile=$(mktemp)

find "${1:-.}" -type f -exec md5sum {} + | sort -o "$tmpfile"
awk 'FNR == NR && seen[$1]++ { next } seen[$1] > 1' "$tmpfile" "$tmpfile"

rm -f "$tmpfile"

这会计算所有文件的 MD5 校验和，对该列表进行排序并将其保存到临时文件中。 awk然后用于提取md5sum所有重复文件的输出。

输出看起来像

$ bash ~/script.sh
01b1688f97f94776baae85d77b06048b  ./QA/StackExchange/.git/hooks/pre-commit.sample
01b1688f97f94776baae85d77b06048b  ./Repositories/password-store.git/hooks/pre-commit.sample
036208b4a1ab4a235d75c181e685e5a3  ./QA/StackExchange/.git/info/exclude
036208b4a1ab4a235d75c181e685e5a3  ./Repositories/password-store.git/info/exclude
054f9ffb8bfe04a599751cc757226dda  ./QA/StackExchange/.git/hooks/pre-applypatch.sample
054f9ffb8bfe04a599751cc757226dda  ./Repositories/password-store.git/hooks/pre-applypatch.sample
2b7ea5cee3c49ff53d41e00785eb974c  ./QA/StackExchange/.git/hooks/post-update.sample
2b7ea5cee3c49ff53d41e00785eb974c  ./Repositories/password-store.git/hooks/post-update.sample
3c5989301dd4b949dfa1f43738a22819  ./QA/StackExchange/.git/hooks/pre-push.sample
3c5989301dd4b949dfa1f43738a22819  ./Repositories/password-store.git/hooks/pre-push.sample

在上面的输出中，恰好有一些重复的对文件数量。

如果您的文件名包含嵌入的换行符，md5sum将输出以字符为前缀的行\：

$ touch $'hello\nworld'
$ md5sum *
\d41d8cd98f00b204e9800998ecf8427e  hello\nworld

为了能够正确处理这个问题（通过删除行开头的反斜杠），您可能需要将脚本中的第一个管道修改为

find "${1:-.}" -type f -exec md5sum {} + | sed 's/^\\//' | sort -o "$tmpfile"

Answer 1