给定文件路径，如何找到该文件的每个副本？

Question 1

一个简单的方法：

获取md5sum目标文件的，将其存储在变量中
获取文件的大小，存储在变量中
用于在所有相同大小的文件上find运行md5sum
grepfind我们的目标 MD5 哈希值的输出

target_hash=$(md5sum needle.file | awk '{ print $1 }')
target_size=$(du -b needle.file | awk '{ print $1 }')
find haystack/ -type f -size "$target_size"c -exec md5sum {} \; | grep $target_hash

Answer

一个简单的方法：

获取md5sum目标文件的，将其存储在变量中
获取文件的大小，存储在变量中
用于在所有相同大小的文件上find运行md5sum
grepfind我们的目标 MD5 哈希值的输出

target_hash=$(md5sum needle.file | awk '{ print $1 }')
target_size=$(du -b needle.file | awk '{ print $1 }')
find haystack/ -type f -size "$target_size"c -exec md5sum {} \; | grep $target_hash

Question 2

你可以使用 Czkawka (平轮和GitHub）。

这是一个很好的 GUI 工具，具有校验和等高级功能。

Answer

你可以使用 Czkawka (平轮和GitHub）。

这是一个很好的 GUI 工具，具有校验和等高级功能。

Question 3

如果文件数量不大 - 例如小于 1000，则 bash 脚本可能适合。否则，在循环中执行二进制文件 ( md5sum, ) 将产生明显的开销。stat

文件大小很重要，因为如果我们有 1000 个 1G 大小的文件，那么二进制加载开销可以忽略不计，因为相对较小。但如果我们有 1000 000 个大小为 1K 的文件，那就另当别论了。

变体 1 号

md5sum用法。

find_dups_by_md5.sh

#!/bin/bash

get_size() {
    stat -c"%s" "$1"
}

get_hash() {
    md5sum "$1" | cut -d' ' -f1 
}

needle=$1
needle_size=$(get_size "$needle")
needle_hash=$(get_hash "$needle")

shopt -s globstar
GLOBIGNORE=$needle

for f in **; do
    cur_file_size=$(get_size "$f")
    if [[ "$needle_size" != "$cur_file_size" ]]; then
        continue
    fi  

    cur_file_hash=$(get_hash "$f")
    if [[ "$needle_hash" != "$cur_file_hash" ]]; then
        continue
    fi  

    echo -e "duplicate:\t${f}"
done

变种 2

cmp用法。

简单的逐字节比较甚至更好 - 更少的代码，相同的结果，甚至更快一点。这里哈希计算是多余的，因为这个哈希只使用一次。对于每个文件，我们通过md5sum（包括针文件）进行哈希并md5sum根据定义处理整个文件。因此，如果我们有 100 个 1 Gygabyte 文件，md5sum将处理所有 100G，即使这些文件最初是不同的 Kilobyte。

因此，在对每个文件与目标进行单一比较的情况下，在最坏情况下逐字节比较时间将相同（所有文件具有相同的内容），或者如果文件具有不同内容则更快（假设 md5 哈希计算时间等于逐字节比较）。

find_dups_by_cmp.sh

#!/bin/bash

get_size() {
    stat -c"%s" "$1"
}

needle=$1
needle_size=$(get_size "$needle")

shopt -s globstar
GLOBIGNORE=$needle

for f in **; do
    cur_file_size=$(get_size "$f")
    if [[ "$needle_size" != "$cur_file_size" ]]; then
        continue
    fi  

    if ! cmp -s "$needle" "$f"; then
        continue
    fi  

    echo -e "duplicate:\t${f}"
done

测试

测试文件生成

###Generate test files
echo_random_bytes () {
    openssl rand -base64 100000;
}

shopt -s globstar

mkdir -p {a..d}/{e..g}/{m..o}

#Fill directories by some files with random content
touch {a..d}/{e..g}/{m..o}/file_{1..5}.txt
for f in **; do
    [ -f "$f" ] && echo_random_bytes > "$f"
done

#Creation of duplicates
same_string=$(echo_random_bytes)

touch {a..d}/{e..g}/m/dup_file.txt
for f in {a..d}/{e..g}/m/dup_file.txt; do
    echo "$same_string" > "$f"
done

#Target file creation
echo "$same_string" > needle_file.txt

搜索重复项

$ ./find_dups_by_md5.sh needle_file.txt
duplicate:  a/e/m/dup_file.txt
duplicate:  a/f/m/dup_file.txt
duplicate:  a/g/m/dup_file.txt
duplicate:  b/e/m/dup_file.txt
duplicate:  b/f/m/dup_file.txt
duplicate:  b/g/m/dup_file.txt
duplicate:  c/e/m/dup_file.txt
duplicate:  c/f/m/dup_file.txt
duplicate:  c/g/m/dup_file.txt
duplicate:  d/e/m/dup_file.txt
duplicate:  d/f/m/dup_file.txt
duplicate:  d/g/m/dup_file.txt

性能对比

$ time ./find_dups_by_md5.sh needle_file.txt > /dev/null

real    0m0,761s
user    0m0,809s
sys 0m0,169s

$ time ./find_dups_by_cmp.sh needle_file.txt > /dev/null

real    0m0,645s
user    0m0,526s
sys 0m0,162s

Answer

如果文件数量不大 - 例如小于 1000，则 bash 脚本可能适合。否则，在循环中执行二进制文件 ( md5sum, ) 将产生明显的开销。stat

文件大小很重要，因为如果我们有 1000 个 1G 大小的文件，那么二进制加载开销可以忽略不计，因为相对较小。但如果我们有 1000 000 个大小为 1K 的文件，那就另当别论了。

变体 1 号

md5sum用法。

find_dups_by_md5.sh

#!/bin/bash

get_size() {
    stat -c"%s" "$1"
}

get_hash() {
    md5sum "$1" | cut -d' ' -f1 
}

needle=$1
needle_size=$(get_size "$needle")
needle_hash=$(get_hash "$needle")

shopt -s globstar
GLOBIGNORE=$needle

for f in **; do
    cur_file_size=$(get_size "$f")
    if [[ "$needle_size" != "$cur_file_size" ]]; then
        continue
    fi  

    cur_file_hash=$(get_hash "$f")
    if [[ "$needle_hash" != "$cur_file_hash" ]]; then
        continue
    fi  

    echo -e "duplicate:\t${f}"
done

变种 2

cmp用法。

简单的逐字节比较甚至更好 - 更少的代码，相同的结果，甚至更快一点。这里哈希计算是多余的，因为这个哈希只使用一次。对于每个文件，我们通过md5sum（包括针文件）进行哈希并md5sum根据定义处理整个文件。因此，如果我们有 100 个 1 Gygabyte 文件，md5sum将处理所有 100G，即使这些文件最初是不同的 Kilobyte。

因此，在对每个文件与目标进行单一比较的情况下，在最坏情况下逐字节比较时间将相同（所有文件具有相同的内容），或者如果文件具有不同内容则更快（假设 md5 哈希计算时间等于逐字节比较）。

find_dups_by_cmp.sh

#!/bin/bash

get_size() {
    stat -c"%s" "$1"
}

needle=$1
needle_size=$(get_size "$needle")

shopt -s globstar
GLOBIGNORE=$needle

for f in **; do
    cur_file_size=$(get_size "$f")
    if [[ "$needle_size" != "$cur_file_size" ]]; then
        continue
    fi  

    if ! cmp -s "$needle" "$f"; then
        continue
    fi  

    echo -e "duplicate:\t${f}"
done

测试

测试文件生成

###Generate test files
echo_random_bytes () {
    openssl rand -base64 100000;
}

shopt -s globstar

mkdir -p {a..d}/{e..g}/{m..o}

#Fill directories by some files with random content
touch {a..d}/{e..g}/{m..o}/file_{1..5}.txt
for f in **; do
    [ -f "$f" ] && echo_random_bytes > "$f"
done

#Creation of duplicates
same_string=$(echo_random_bytes)

touch {a..d}/{e..g}/m/dup_file.txt
for f in {a..d}/{e..g}/m/dup_file.txt; do
    echo "$same_string" > "$f"
done

#Target file creation
echo "$same_string" > needle_file.txt

搜索重复项

$ ./find_dups_by_md5.sh needle_file.txt
duplicate:  a/e/m/dup_file.txt
duplicate:  a/f/m/dup_file.txt
duplicate:  a/g/m/dup_file.txt
duplicate:  b/e/m/dup_file.txt
duplicate:  b/f/m/dup_file.txt
duplicate:  b/g/m/dup_file.txt
duplicate:  c/e/m/dup_file.txt
duplicate:  c/f/m/dup_file.txt
duplicate:  c/g/m/dup_file.txt
duplicate:  d/e/m/dup_file.txt
duplicate:  d/f/m/dup_file.txt
duplicate:  d/g/m/dup_file.txt

性能对比

$ time ./find_dups_by_md5.sh needle_file.txt > /dev/null

real    0m0,761s
user    0m0,809s
sys 0m0,169s

$ time ./find_dups_by_cmp.sh needle_file.txt > /dev/null

real    0m0,645s
user    0m0,526s
sys 0m0,162s

Question 4

根据 Panki 的回答，这应该减少的调用md5sum，如果要检查数千个文件，这将提高性能。

target_hash="$(md5sum needle.file | awk '{ print $1 }')"
target_size="$(du -b needle.file | awk '{ print $1 }')"
find haystack/ -type f -size "$target_size"c -print0 | xargs -0 md5sum | grep "^$target_hash"

注意：与原始文件一样，如果文件名包含换行符，则可能会出现显示问题。

Answer

根据 Panki 的回答，这应该减少的调用md5sum，如果要检查数千个文件，这将提高性能。

target_hash="$(md5sum needle.file | awk '{ print $1 }')"
target_size="$(du -b needle.file | awk '{ print $1 }')"
find haystack/ -type f -size "$target_size"c -print0 | xargs -0 md5sum | grep "^$target_hash"

注意：与原始文件一样，如果文件名包含换行符，则可能会出现显示问题。

给定文件路径，如何找到该文件的每个副本？

答案1

答案2

答案3

变体 1 号

变种 2

答案4

相关内容