当我尝试在装有 CentOS 5 和 4 GB RAM 的 Linux 机器上比较两个非常相似的 27 GB 文件时,出现了diff: memory exhausted
错误。这似乎是一个已知问题。
我希望有这样一个必不可少的实用程序的替代方案,但我找不到。我想解决方案必须使用临时文件而不是内存来存储所需的信息。
- 我尝试使用
rdiff
和xdelta
,但它们更适合显示两个文件之间的变化,比如补丁,而对于检查两个文件之间的差异则不那么有用。 - 尝试过二进制差异,但它是一种可视化工具,更适合比较二进制文件。我需要一些可以像常规一样将差异传输到 STDOUT 的东西
diff
。 - 还有许多其他实用程序,例如
vimdiff
仅适用于较小文件的实用程序。 - 我也读过有关 Solaris 的文章
bdiff
,但是找不到适用于 Linux 的端口。
除了将文件分割成小块之外,还有其他想法吗?我有 40 个这样的文件,因此想避免拆分它们。
答案1
cmp
逐字节执行操作,因此可能不会耗尽内存(刚刚在两个 7 GB 的文件上进行了测试)——但你可能正在寻找比“文件X和是字节相差X, 线是“。如果您的文件的相似性存在偏移(例如,文件是具有相同的文本块,但不在同一位置),您可以将偏移量传递给cmp
;您可能可以使用小脚本将其转变为重新同步比较。
附言:以防其他人在寻找方法来确认两个目录结构(包含非常大的文件)相同时来到这里:(
diff --recursive --brief
或diff -r -q
简称,或甚至diff -rq
)将起作用并且不会耗尽内存。
答案2
我找到了这个关联
diff -H 可能会有帮助,或者您可以尝试安装 textproc/2bsd-diff 端口,它显然不会尝试将文件加载到 RAM 中,因此它可以更轻松地处理大文件。
我不确定您是否尝试过这两个选项,或者它们是否适合您。祝你好运。
答案3
如果文件除了几个字节值外完全相同(长度相同),则可以使用如下脚本(w
是每行要进行十六进制转储的字节数,请根据您的显示宽度进行调整):
w=12;
while read -ru7 x && read -ru8 y;
do
[ ".$x" = ".$y" ] || echo "$x | $y";
done 7< <(od -vw$w -tx1z FILE1) 8< <(od -vw$w -tx1z FILE2) > DIFF-FILE1-FILE2 &
less DIFF-FILE1-FILE2
虽然速度不是很快,但是可以完成工作。
答案4
因此,这不完全是 OP 的问题,但一个相关的问题是,你有两个大型数据库转储,每个插入/记录都在自己的行上,但浮点实现的不同差异导致数字偏离一些 IEEE 错误。 感谢提供答案通过@Diomidis 和下面显示的一行庞大的 awk 脚本,我们得到了一个功能齐全、高效的模糊差异。
将下面的文本添加到某个脚本目录中fuzzy-compare.awk
,根据需要调整 BEGIN 部分中的参数(特定于语言环境、调试模式等),然后将输出导入paste
其中:
paste -d $'\a' file1 file2 | awk -f fuzzy-compare.awk
示例输出:
Line 1 diffs found so far: 1 here at field: 4
75747358 1 53 2011-03-29 23:00:00+00 7.428
75747358 1 53 2011-03-28 23:00:00+00 7.428
Line 2 diffs found so far: 2 here at field: 4
75747359 1 53 2011-03-29 23:30:00+00 5.757
75747359 1 53 2011-03-29 23:30:00+01 5.757
Line 3 diffs found so far: 3 here at field: 3
75747360 1 53 2011-03-30 00:00:00+00 6.739
75747360 1 54 2011-03-30 00:00:00+00 6.74
Line 5 diffs found so far: 4
75747362 1 53 2011-03-30 01:00:00+00 6.736 extra-field
75747362 1 53 2011-03-30 01:00:00+00 6.73599999999999977
差异显示:
# diff sample.sql sample2.sql
1,3c1,3
< 75747358 1 53 2011-03-29 23:00:00+00 7.428
< 75747359 1 53 2011-03-29 23:30:00+00 5.757
< 75747360 1 53 2011-03-30 00:00:00+00 6.739
---
> 75747358 1 53 2011-03-28 23:00:00+00 7.428
> 75747359 1 53 2011-03-29 23:30:00+01 5.757
> 75747360 1 54 2011-03-30 00:00:00+00 6.74
5,13c5,13
< 75747362 1 53 2011-03-30 01:00:00+00 6.736 extra-field
< 75747363 1 53 2011-03-30 01:30:00+00 7.576
< 75747364 1 53 2011-03-30 02:00:00+00 6.789
< 75747365 1 53 2011-03-30 02:30:00+00 6.386e+2
< 75747366 1 53 2011-03-30 03:00:00+00 6.016E-2
< 75747367 1 53 2011-03-30 03:30:00+00 6.336
< 75747368 1 53 2011-03-30 04:00:00+00 6.1
< 75747374 1 53 2011-03-30 07:00:00+00 5.9412
< 75747375 1 53 2011-03-30 07:30:00+00 6.137803249
---
> 75747362 1 53 2011-03-30 01:00:00+00 6.73599999999999977
> 75747363 1 53 2011-03-30 01:30:00+00 7.576e+10
> 75747364 1 53 2011-03-30 02:00:00+00 6.789e-10
> 75747365 1 53 2011-03-30 02:30:00+00 6.38600000000000012e+2
> 75747366 1 53 2011-03-30 03:00:00+00 6.01600000000000001E-2
> 75747367 1 53 2011-03-30 03:30:00+00 6.3360000000000003
> 75747368 1 53 2011-03-30 04:00:00+00 6.0999999999999993
> 75747374 1 53 2011-03-30 07:00:00+00 5.94099999999999984
> 75747375 1 53 2011-03-30 07:30:00+00 6.13780324900000007
下面的代码(复制到 github gist:https://gist.github.com/otheus/92162e3a764d2697c3272b98b2663a94)。
#!/bin/awk -f
## Awk script to compare to SQL (postgres) dumps for which each line of input is a row
## and has been preprocessed by
## paste -d $'\a' file1 file2
## The BEL symbol is used by this program to quickly split the input
##
## Sometimes, numbers differ by some kind of rounding error / floating-point implementation
## Ignore that error by subtracting the two values and seeing if they are < maxdiff,
## maxdiff = 1 / (10 ^ (length-after-decimal-point(shortest-value))
## Consider:
## 4.2 vs 4.19998
## The shortest number is 4.2, its length is
## Notes:
## d is the global *d*iff counter
## p is the *p*osition / field that first had a difference
## i is a loop variable,usually current field
## L is the array of fields from the current line of the *L*eft-file
## R is " " " " " " " " " " " *R*ight-file
## clhs is the number of fields in L
## crhs is the number of fields in R
BEGIN {
FS="\a";
DECIMAL_SEP=".";
FIELD_SEP="\t"; # for postgresql; for mysql, maybe ", ";
MAX_DIFFS=10;
DEBUG=0;
# Efficiently fill out our table of maximum tolerances of values
Maxdiffs[1] = 0.1;
for (i=2; i<31; ++i)
Maxdiffs[i] = Maxdiffs[i-1] / 10;
p=-1; # everything starts out fine.
}
# if -v start=...., skip until that line
NR < (0 + start) { next }
# When pairs don't match, investigate further...
("_" $1) != ("_" $2) {
if (DEBUG>1) print "Line",NR ": Input lines differed somehow. Investigating...";
p=0; # p is field# where difference was found; 0 means whole line
# split each half into tab-delimited fields
clhs=split($1,L,FIELD_SEP);
crhs=split($2,R,FIELD_SEP);
if (clhs == crhs) {
if (DEBUG>1) print "Line",NR ": Same number of tokens in each line, delimited by '" FIELD_SEP "'";
## compare field by field
p = -1; # if we don't set p in the loop below, no real differences
# Compare each field, until a difference is found
for (i=1; i<=clhs && p<0; ++i) {
# Hint: force this compare to be string-based
if (("_" L[i]) != ("_" R[i])) {
if (DEBUG>1) print "Line",NR ": Field",i,"differs somehow";
## They differ... but are they numbers?
if ( \
L[i] ~ /^-?[0-9]+\.[0-9]+([eE][-+][0-9]+)?$/ && \
R[i] ~ /^-?[0-9]+\.[0-9]+([eE][-+][0-9]+)?$/ \
) {
# both fields are floating-point numbers, compare loosely
# strip exponent part
sub(/[eE].*/,"",L[i]);sub(/[eE].*/,"",R[i]);
# determine precision of shortest value
precision=( \
length(L[i]) < length(R[i]) ? \
length(L[i]) - index(L[i],DECIMAL_SEP) : \
length(R[i]) - index(R[i],DECIMAL_SEP) \
);
# look up the maxdiff from our table
maxdiff=Maxdiffs[precision];
diff=(L[1] - R[1]);
if (diff > maxdiff || diff < -maxdiff) {
if (DEBUG) print "Line",NR ": Numbers differed at",i,"between",L[i],"and",R[i],"differing more than",maxdiff;
p=i;
}
else {
if (DEBUG) print "Line",NR ": Numbers differed at",i,"between",L[i],"and",R[i],"but differed less than",maxdiff;
}
}
else {
if (DEBUG) print "Line",NR ": Strings or ints differed at",i,"between",L[i],"and",R[i];
p=i;
}
}
else {
if (DEBUG) print "Line",NR ": No differences found";
}
}
}
# else, field count is different, so whole line is.
else {
if (DEBUG) print "Line",NR ": Number of fields in line differ";
}
}
p>=0 {
++d; # bump total diffs count
# Output a little header for each non-matching records
print "Line",NR,"diffs found so far:",d,(p ? "here at field: " p : "" );
# Output the lines that didnt match
print $1; print $2; print "";
p=-1;
}
# Progress counter
NR % 100000 == 0 { print "Line",NR }
d > MAX_DIFFS { exit(1);}
请注意,上述代码在发布之前只有一行。