查找重复文件并用符号链接替换它们

Question 1

如果您不喜欢太多脚本，那么我可以推荐查找。它将扫描给定目录中的重复文件，并用硬链接或符号链接替换它们。我已经使用它对 Ruby gems 目录进行重复数据删除，并取得了巨大成功。它可以在 Debian/Ubuntu 中使用。

Answer

如果您不喜欢太多脚本，那么我可以推荐查找。它将扫描给定目录中的重复文件，并用硬链接或符号链接替换它们。我已经使用它对 Ruby gems 目录进行重复数据删除，并取得了巨大成功。它可以在 Debian/Ubuntu 中使用。

Question 2

我有类似的情况，但在我的情况下，符号链接应该指向相对路径，所以我写了这个 python 脚本做到这一点：

#!/usr/bin/env python
# Reads fdupes(-r -1) output and create relative symbolic links for each duplicate
# usage: fdupes -r1 . | ./lndupes.py

import os
from os.path import dirname, relpath, basename, join
import sys

lines = sys.stdin.readlines()

for line in lines:
    files = line.strip().split(' ')
    first = files[0]
    print "First: %s "% first
    for dup in files[1:]:
        rel = os.path.relpath(dirname(first), dirname(dup))
        print "Linking duplicate: %s to %s" % (dup, join(rel,basename(first)))
        os.unlink(dup)
        os.symlink(join(rel,basename(first)), dup)

对于每个输入行（文件列表），脚本会拆分文件列表（以空格分隔），获取从每个文件到第一个文件的相对路径，然后创建符号链接。

Answer

我有类似的情况，但在我的情况下，符号链接应该指向相对路径，所以我写了这个 python 脚本做到这一点：

#!/usr/bin/env python
# Reads fdupes(-r -1) output and create relative symbolic links for each duplicate
# usage: fdupes -r1 . | ./lndupes.py

import os
from os.path import dirname, relpath, basename, join
import sys

lines = sys.stdin.readlines()

for line in lines:
    files = line.strip().split(' ')
    first = files[0]
    print "First: %s "% first
    for dup in files[1:]:
        rel = os.path.relpath(dirname(first), dirname(dup))
        print "Linking duplicate: %s to %s" % (dup, join(rel,basename(first)))
        os.unlink(dup)
        os.symlink(join(rel,basename(first)), dup)

对于每个输入行（文件列表），脚本会拆分文件列表（以空格分隔），获取从每个文件到第一个文件的相对路径，然后创建符号链接。

Question 3

第一的;您需要使用符号链接而不是通常的硬链接是否有原因？我很难理解具有相对路径的符号链接的必要性。这是我解决这个问题的方法：

我认为 Debian (Ubuntu) 版本的 fdupes 可以使用该-L选项用硬链接替换重复项，但我没有 Debian 安装来验证这一点。

如果您没有带有该-L选项的版本，您可以使用我在上找到的这个小 bash 脚本命令行fu。
请注意，此语法仅适用于 bash。

fdupes -r -1 path | while read line; do master=""; for file in ${line[*]}; do if [ "x${master}" == "x" ]; then master=$file; else ln -f "${master}" "${file}"; fi; done; done

上面的命令将找到“path”中的所有重复文件，并将它们替换为硬链接。您可以通过运行ls -ilR并查看索引节点号来验证这一点。这是包含十个相同文件的示例：

$ ls -ilR

total 20
3094308 -rw------- 1 username group  5 Sep 14 17:21 file
3094311 -rw------- 1 username group  5 Sep 14 17:21 file2
3094312 -rw------- 1 username group  5 Sep 14 17:21 file3
3094313 -rw------- 1 username group  5 Sep 14 17:21 file4
3094314 -rw------- 1 username group  5 Sep 14 17:21 file5
3094315 drwx------ 1 username group 48 Sep 14 17:22 subdirectory

./subdirectory:
total 20
3094316 -rw------- 1 username group 5 Sep 14 17:22 file
3094332 -rw------- 1 username group 5 Sep 14 17:22 file2
3094345 -rw------- 1 username group 5 Sep 14 17:22 file3
3094346 -rw------- 1 username group 5 Sep 14 17:22 file4
3094347 -rw------- 1 username group 5 Sep 14 17:22 file5

所有文件都有单独的索引节点号，使它们成为单独的文件。现在让我们对它们进行重复数据删除：

$ fdupes -r -1 . | while read line; do j="0"; for file in ${line[*]}; do if [ "$j" == "0" ]; then j="1"; else ln -f ${line// .*/} $file; fi; done; done
$ ls -ilR
.:
total 20
3094308 -rw------- 10 username group  5 Sep 14 17:21 file
3094308 -rw------- 10 username group  5 Sep 14 17:21 file2
3094308 -rw------- 10 username group  5 Sep 14 17:21 file3
3094308 -rw------- 10 username group  5 Sep 14 17:21 file4
3094308 -rw------- 10 username group  5 Sep 14 17:21 file5
3094315 drwx------  1 username group 48 Sep 14 17:24 subdirectory

./subdirectory:
total 20
3094308 -rw------- 10 username group 5 Sep 14 17:21 file
3094308 -rw------- 10 username group 5 Sep 14 17:21 file2
3094308 -rw------- 10 username group 5 Sep 14 17:21 file3
3094308 -rw------- 10 username group 5 Sep 14 17:21 file4
3094308 -rw------- 10 username group 5 Sep 14 17:21 file5

现在，这些文件都具有相同的索引节点号，这意味着它们都指向磁盘上相同的物理数据。

我希望这能解决您的问题，或者至少为您指明正确的方向！

Answer

第一的;您需要使用符号链接而不是通常的硬链接是否有原因？我很难理解具有相对路径的符号链接的必要性。这是我解决这个问题的方法：

我认为 Debian (Ubuntu) 版本的 fdupes 可以使用该-L选项用硬链接替换重复项，但我没有 Debian 安装来验证这一点。

如果您没有带有该-L选项的版本，您可以使用我在上找到的这个小 bash 脚本命令行fu。
请注意，此语法仅适用于 bash。

fdupes -r -1 path | while read line; do master=""; for file in ${line[*]}; do if [ "x${master}" == "x" ]; then master=$file; else ln -f "${master}" "${file}"; fi; done; done

上面的命令将找到“path”中的所有重复文件，并将它们替换为硬链接。您可以通过运行ls -ilR并查看索引节点号来验证这一点。这是包含十个相同文件的示例：

$ ls -ilR

total 20
3094308 -rw------- 1 username group  5 Sep 14 17:21 file
3094311 -rw------- 1 username group  5 Sep 14 17:21 file2
3094312 -rw------- 1 username group  5 Sep 14 17:21 file3
3094313 -rw------- 1 username group  5 Sep 14 17:21 file4
3094314 -rw------- 1 username group  5 Sep 14 17:21 file5
3094315 drwx------ 1 username group 48 Sep 14 17:22 subdirectory

./subdirectory:
total 20
3094316 -rw------- 1 username group 5 Sep 14 17:22 file
3094332 -rw------- 1 username group 5 Sep 14 17:22 file2
3094345 -rw------- 1 username group 5 Sep 14 17:22 file3
3094346 -rw------- 1 username group 5 Sep 14 17:22 file4
3094347 -rw------- 1 username group 5 Sep 14 17:22 file5

所有文件都有单独的索引节点号，使它们成为单独的文件。现在让我们对它们进行重复数据删除：

$ fdupes -r -1 . | while read line; do j="0"; for file in ${line[*]}; do if [ "$j" == "0" ]; then j="1"; else ln -f ${line// .*/} $file; fi; done; done
$ ls -ilR
.:
total 20
3094308 -rw------- 10 username group  5 Sep 14 17:21 file
3094308 -rw------- 10 username group  5 Sep 14 17:21 file2
3094308 -rw------- 10 username group  5 Sep 14 17:21 file3
3094308 -rw------- 10 username group  5 Sep 14 17:21 file4
3094308 -rw------- 10 username group  5 Sep 14 17:21 file5
3094315 drwx------  1 username group 48 Sep 14 17:24 subdirectory

./subdirectory:
total 20
3094308 -rw------- 10 username group 5 Sep 14 17:21 file
3094308 -rw------- 10 username group 5 Sep 14 17:21 file2
3094308 -rw------- 10 username group 5 Sep 14 17:21 file3
3094308 -rw------- 10 username group 5 Sep 14 17:21 file4
3094308 -rw------- 10 username group 5 Sep 14 17:21 file5

现在，这些文件都具有相同的索引节点号，这意味着它们都指向磁盘上相同的物理数据。

我希望这能解决您的问题，或者至少为您指明正确的方向！

Question 4

前面的一些注意事项：

BASH 特定
文件名中没有空格
假设每行最多包含 2 个文件。

fdupes -1r common/base/dir | while read -r -a line ; do ln -sf $(realpath --relative-to ${line[1]} ${line[0]}) ${line[1]}; done

如果超过 2 个文件重复（例如 file1 file2 file3），那么我们需要为每对文件创建一个符号链接 - 将 file1,file2 和 file1,file3 视为 2 个单独的情况：

if [[ ${#line[@]} -gt 2 ]] ;then 
  ln -sf $(realpath --relative-to ${line[1]} ${line[0]}) ${line[1]} 
  ln -sf $(realpath --relative-to ${line[2]} ${line[0]}) ${line[2]} 
  ...
fi

扩展它来自动处理每行任意数量的重复项将需要更多的努力。

另一种方法是首先创建绝对路径的符号链接，然后将它们转换：

fdupes -1r /absolute/path/common/base/dir | while read -r -a line ; do ln -sf ${line[0]} ${line[1]}; done
chroot /absolute/path/common/base/dir ; symlinks -cr .

这是基于@Gilles 的回答： https://unix.stackexchange.com/a/100955/77319

Answer