我有两个目录,每个目录都有数十万个文件。大多数文件在两个位置之间都是重复的,但有些不是,我想了解哪些文件没有重复(这意味着例如它们没有备份,然后我可以选择备份或删除它们)。关键是每个文件的路径相对于父目录可能完全不同。有些文件可能相同但名称不同,该工具应该比较校验和以从输出中消除这些文件。
答案1
有一个名为 fdupes 的奇妙小程序可以帮助解决此问题 - 请小心,因为您也可以将其设置为删除所有重复项以及其他有趣的事情。
答案2
我还没有测试这么多,但这是迄今为止的 fdupes 解决方案:
#!/usr/bin/env python
# Load a list of files in dir1, dir2,
# and a list of dupes provided by fdupes,
# and produce a list of files that aren't duplicated.
#
# find dir1 -type f > dir1.txt
# find dir2 -type f > dir2.txt
# fdupes dir1 dir2 -1 > dupes.txt
import sys
# print sys.argv
dir1_file = sys.argv[1]
dir2_file = sys.argv[2]
dupes_file = sys.argv[3]
dir1 = {}
with open(dir1_file) as f:
for line in f:
dir1[line.strip()] = True
dir2 = {}
with open(dir2_file) as f:
for line in f:
dir2[line.strip()] = True
dupes = {}
with open(dupes_file) as f:
for line in f:
(dir1_dupe, dir2_dupe) = line.split()
rv1 = dir1.pop(dir1_dupe, None)
rv2 = dir2.pop(dir2_dupe, None)
# print "non dupes:"
for key in dir1.keys():
print key
for key in dir2.keys():
print key