查找并列出重复的目录

Question 1

我的音乐收藏也有同样的问题...大多数工具/脚本都很吵（列出文件名）或对文件内容进行校验，这太慢了...

特殊字符、空格和符号使这变得具有挑战性......策略是 MD5sum已排序文件名字与父目录一起，然后脚本可以对哈希值进行排序以查找重复项。我们必须对子文件名进行排序，因为 find 不能保证两个不同目录中的文件顺序。

Bash 脚本（Debian 10）：

#!/bin/bash

# usage: ./find_duplicates tunes_dir
# output: c547c3bcf85b9c578a1a52dd20665343 - /mnt/tunes/soul brothers/Motherlode
# MD5 is generated from all children filenames + album folder name
# sort list by MD5 then list duplicate (32bit hashes) representing albums
# Album/CD1/... Album/CD2/... will show (3) results if Album is duplicated
# CD1/2 example is indistinguishable from Discography/Album/Song.mp3

if [ $# -eq 0 ]; then
    echo "Please supply tunes directory as first arg"
    exit 1
fi

# Using absolute path of tunes_dir param
find $(readlink -f $1) -type d | while IFS= read -r line
do
    cd "$line"
    children=$(find ./ -type f | sort)
    base=$(basename "$line")
    sum=$(echo $children $base | md5sum)
    echo $sum $line
done | sort -n | uniq -D -w 32

目录结构：

user@pc:~/test# find . -type d
./super soul brothers
./super soul brothers/Stritch's Brew
./super soul brothers/Fireball!
./super soul brothers/Motherlode
./car_tunes
./car_tunes/Fireball!

输出示例：

user@pc:~# ./find_duplicates  test/
07b0f79429663685f4005486af20247a - /root/test/car_tunes/Fireball!
07b0f79429663685f4005486af20247a - /root/test/super soul brothers/Fireball!

Answer

我的音乐收藏也有同样的问题...大多数工具/脚本都很吵（列出文件名）或对文件内容进行校验，这太慢了...

特殊字符、空格和符号使这变得具有挑战性......策略是 MD5sum已排序文件名字与父目录一起，然后脚本可以对哈希值进行排序以查找重复项。我们必须对子文件名进行排序，因为 find 不能保证两个不同目录中的文件顺序。

Bash 脚本（Debian 10）：

#!/bin/bash

# usage: ./find_duplicates tunes_dir
# output: c547c3bcf85b9c578a1a52dd20665343 - /mnt/tunes/soul brothers/Motherlode
# MD5 is generated from all children filenames + album folder name
# sort list by MD5 then list duplicate (32bit hashes) representing albums
# Album/CD1/... Album/CD2/... will show (3) results if Album is duplicated
# CD1/2 example is indistinguishable from Discography/Album/Song.mp3

if [ $# -eq 0 ]; then
    echo "Please supply tunes directory as first arg"
    exit 1
fi

# Using absolute path of tunes_dir param
find $(readlink -f $1) -type d | while IFS= read -r line
do
    cd "$line"
    children=$(find ./ -type f | sort)
    base=$(basename "$line")
    sum=$(echo $children $base | md5sum)
    echo $sum $line
done | sort -n | uniq -D -w 32

目录结构：

user@pc:~/test# find . -type d
./super soul brothers
./super soul brothers/Stritch's Brew
./super soul brothers/Fireball!
./super soul brothers/Motherlode
./car_tunes
./car_tunes/Fireball!

输出示例：

user@pc:~# ./find_duplicates  test/
07b0f79429663685f4005486af20247a - /root/test/car_tunes/Fireball!
07b0f79429663685f4005486af20247a - /root/test/super soul brothers/Fireball!

Question 2

使用bash版本 4 或更高版本。在 macOS 上，这可以通过 Homebrew 包管理器安装，因为默认值bash太旧。

# Make glob patterns disappear rather than remain unexpanded
# if the don't match (nullglob).
# Make glob patterns also match hidden names (dotglob).
shopt -s nullglob dotglob

# Create an associative array that hold the number of times
# a directory's name has been seen (the basename of the directory's
# pathname is the key into this array).
declare -A count

# Set the positional parameters ($1, $2, etc.) to the pathnames
# of the directories that we're interested in.
set -- Top_Dir/*/*/

# Loop over out directory paths,
# and count how many times each basename occurs.
for dirpath do
        name=$( basename "$dirpath" )
        count["$name"]=$(( count["$name"] + 1 ))
done

# Loop over the directory paths again, but this time
# output each directory whose basename occurs more than once.
for dirpath do
        name=$( basename "$dirpath" )
        [[ ${count["$name"]} -gt 1 ]] && printf '%s\n' "$dirpath"
done

测试：

$ tree -F
.
|-- Top_Dir/
|   |-- Level_1_Dir/
|   |   |-- standard_cat/
|   |   |-- standard_dog/
|   |   `-- standard_snake/
|   |-- Level_2_Dir/
|   |   |-- standard_cat/
|   |   |-- standard_moon/
|   |   `-- standard_sun/
|   `-- Level_3_Dir/
|       |-- standard_man/
|       |-- standard_moon/
|       `-- standard_woman/
`-- script

13 directories, 1 file

$ bash script
Top_Dir/Level_1_Dir/standard_cat/
Top_Dir/Level_2_Dir/standard_cat/
Top_Dir/Level_2_Dir/standard_moon/
Top_Dir/Level_3_Dir/standard_moon/

为了支持旧bash版本，您可以选择存储目录的唯一基本名称以及每个基本名称在两个单独的普通数组中出现的次数。这需要在每个循环中进行线性搜索：

shopt -s nullglob dotglob

set -- Top_Dir/*/*/

names=()
counts=()
for dirpath do
        name=$( basename "$dirpath" )

        found=false
        for i in "${!names[@]}"; do
                if [[ ${names[i]} == "$name" ]]; then
                        found=true
                        break
                fi
        done

        if "$found"; then
                counts[i]=$(( counts[i] + 1 ))
        else
                names+=( "$name" )
                counts+=( 1 )
        fi
done

for dirpath do
        name=$( basename "$dirpath" )

        for i in "${!names[@]}"; do
                if [[ ${names[i]} == "$name" ]]; then
                        [[ ${counts[i]} -gt 1 ]] && printf '%s\n' "$dirpath"
                        break
                fi
        done
done

Answer

使用bash版本 4 或更高版本。在 macOS 上，这可以通过 Homebrew 包管理器安装，因为默认值bash太旧。

# Make glob patterns disappear rather than remain unexpanded
# if the don't match (nullglob).
# Make glob patterns also match hidden names (dotglob).
shopt -s nullglob dotglob

# Create an associative array that hold the number of times
# a directory's name has been seen (the basename of the directory's
# pathname is the key into this array).
declare -A count

# Set the positional parameters ($1, $2, etc.) to the pathnames
# of the directories that we're interested in.
set -- Top_Dir/*/*/

# Loop over out directory paths,
# and count how many times each basename occurs.
for dirpath do
        name=$( basename "$dirpath" )
        count["$name"]=$(( count["$name"] + 1 ))
done

# Loop over the directory paths again, but this time
# output each directory whose basename occurs more than once.
for dirpath do
        name=$( basename "$dirpath" )
        [[ ${count["$name"]} -gt 1 ]] && printf '%s\n' "$dirpath"
done

测试：

$ tree -F
.
|-- Top_Dir/
|   |-- Level_1_Dir/
|   |   |-- standard_cat/
|   |   |-- standard_dog/
|   |   `-- standard_snake/
|   |-- Level_2_Dir/
|   |   |-- standard_cat/
|   |   |-- standard_moon/
|   |   `-- standard_sun/
|   `-- Level_3_Dir/
|       |-- standard_man/
|       |-- standard_moon/
|       `-- standard_woman/
`-- script

13 directories, 1 file

$ bash script
Top_Dir/Level_1_Dir/standard_cat/
Top_Dir/Level_2_Dir/standard_cat/
Top_Dir/Level_2_Dir/standard_moon/
Top_Dir/Level_3_Dir/standard_moon/

为了支持旧bash版本，您可以选择存储目录的唯一基本名称以及每个基本名称在两个单独的普通数组中出现的次数。这需要在每个循环中进行线性搜索：

shopt -s nullglob dotglob

set -- Top_Dir/*/*/

names=()
counts=()
for dirpath do
        name=$( basename "$dirpath" )

        found=false
        for i in "${!names[@]}"; do
                if [[ ${names[i]} == "$name" ]]; then
                        found=true
                        break
                fi
        done

        if "$found"; then
                counts[i]=$(( counts[i] + 1 ))
        else
                names+=( "$name" )
                counts+=( 1 )
        fi
done

for dirpath do
        name=$( basename "$dirpath" )

        for i in "${!names[@]}"; do
                if [[ ${names[i]} == "$name" ]]; then
                        [[ ${counts[i]} -gt 1 ]] && printf '%s\n' "$dirpath"
                        break
                fi
        done
done

Question 3

这可以在 Ubuntu 上使用 bash 进行。它只匹配重复的目录，无论树中的深度如何。 $() 中的部分通过计算的最后一列中的重复项来构建以管道分隔的目录名称列表ls -l。这个以竖线分隔的列表是使用 grep 对所有目录的列表进行过滤的。另外，不考虑其他文件（不使用全字匹配等）

> ls -lR Top_Dir/ | grep -E $(ls -lR Top_Dir/ | grep ^d | rev | cut -d" " -f1 | rev | sort | uniq -d | head -c -1 | tr '\n' '|') | grep -v ^d | sed 's/://'

Top_Dir/Level_1_Dir/standard_cat

Top_Dir/Level_2_Dir/standard_cat

Top_Dir/Level_2_Dir/standard_moon

Top_Dir/Level_3_Dir/standard_moon

Answer

这可以在 Ubuntu 上使用 bash 进行。它只匹配重复的目录，无论树中的深度如何。 $() 中的部分通过计算的最后一列中的重复项来构建以管道分隔的目录名称列表ls -l。这个以竖线分隔的列表是使用 grep 对所有目录的列表进行过滤的。另外，不考虑其他文件（不使用全字匹配等）

> ls -lR Top_Dir/ | grep -E $(ls -lR Top_Dir/ | grep ^d | rev | cut -d" " -f1 | rev | sort | uniq -d | head -c -1 | tr '\n' '|') | grep -v ^d | sed 's/://'

Top_Dir/Level_1_Dir/standard_cat

Top_Dir/Level_2_Dir/standard_cat

Top_Dir/Level_2_Dir/standard_moon

Top_Dir/Level_3_Dir/standard_moon

查找并列出重复的目录

答案1

答案2

答案3

相关内容