如何在 Linux 上比较大文件

Question 1

cmp逐字节执行操作，因此可能不会耗尽内存（刚刚在两个 7 GB 的文件上进行了测试）——但你可能正在寻找比“文件X和是字节相差X，线是“。如果您的文件的相似性存在偏移（例如，文件是具有相同的文本块，但不在同一位置），您可以将偏移量传递给cmp；您可能可以使用小脚本将其转变为重新同步比较。

附言：以防其他人在寻找方法来确认两个目录结构（包含非常大的文件）相同时来到这里：（ diff --recursive --brief或diff -r -q简称，或甚至diff -rq）将起作用并且不会耗尽内存。

Answer

cmp逐字节执行操作，因此可能不会耗尽内存（刚刚在两个 7 GB 的文件上进行了测试）——但你可能正在寻找比“文件X和是字节相差X，线是“。如果您的文件的相似性存在偏移（例如，文件是具有相同的文本块，但不在同一位置），您可以将偏移量传递给cmp；您可能可以使用小脚本将其转变为重新同步比较。

附言：以防其他人在寻找方法来确认两个目录结构（包含非常大的文件）相同时来到这里：（ diff --recursive --brief或diff -r -q简称，或甚至diff -rq）将起作用并且不会耗尽内存。

Question 2

我找到了这个关联

diff -H 可能会有帮助，或者您可以尝试安装 textproc/2bsd-diff 端口，它显然不会尝试将文件加载到 RAM 中，因此它可以更轻松地处理大文件。

我不确定您是否尝试过这两个选项，或者它们是否适合您。祝你好运。

Answer

我找到了这个关联

diff -H 可能会有帮助，或者您可以尝试安装 textproc/2bsd-diff 端口，它显然不会尝试将文件加载到 RAM 中，因此它可以更轻松地处理大文件。

我不确定您是否尝试过这两个选项，或者它们是否适合您。祝你好运。

Question 3

如果文件除了几个字节值外完全相同（长度相同），则可以使用如下脚本（w是每行要进行十六进制转储的字节数，请根据您的显示宽度进行调整）：

w=12;
while read -ru7 x && read -ru8 y;
do
  [ ".$x" = ".$y" ] || echo "$x | $y";
done 7< <(od -vw$w -tx1z FILE1) 8< <(od -vw$w -tx1z FILE2) > DIFF-FILE1-FILE2 &

less DIFF-FILE1-FILE2

虽然速度不是很快，但是可以完成工作。

Answer

如果文件除了几个字节值外完全相同（长度相同），则可以使用如下脚本（w是每行要进行十六进制转储的字节数，请根据您的显示宽度进行调整）：

w=12;
while read -ru7 x && read -ru8 y;
do
  [ ".$x" = ".$y" ] || echo "$x | $y";
done 7< <(od -vw$w -tx1z FILE1) 8< <(od -vw$w -tx1z FILE2) > DIFF-FILE1-FILE2 &

less DIFF-FILE1-FILE2

虽然速度不是很快，但是可以完成工作。

Question 4

因此，这不完全是 OP 的问题，但一个相关的问题是，你有两个大型数据库转储，每个插入/记录都在自己的行上，但浮点实现的不同差异导致数字偏离一些 IEEE 错误。感谢提供答案通过@Diomidis 和下面显示的一行庞大的 awk 脚本，我们得到了一个功能齐全、高效的模糊差异。

将下面的文本添加到某个脚本目录中fuzzy-compare.awk，根据需要调整 BEGIN 部分中的参数（特定于语言环境、调试模式等），然后将输出导入paste其中：

paste -d $'\a' file1 file2 | awk -f fuzzy-compare.awk

示例输出：

Line 1 diffs found so far: 1 here at field: 4
75747358        1       53      2011-03-29 23:00:00+00  7.428
75747358        1       53      2011-03-28 23:00:00+00  7.428

Line 2 diffs found so far: 2 here at field: 4
75747359        1       53      2011-03-29 23:30:00+00  5.757
75747359        1       53      2011-03-29 23:30:00+01  5.757

Line 3 diffs found so far: 3 here at field: 3
75747360        1       53      2011-03-30 00:00:00+00  6.739
75747360        1       54      2011-03-30 00:00:00+00  6.74

Line 5 diffs found so far: 4
75747362        1       53      2011-03-30 01:00:00+00  6.736   extra-field
75747362        1       53      2011-03-30 01:00:00+00  6.73599999999999977

差异显示：

# diff sample.sql sample2.sql
1,3c1,3
< 75747358      1       53      2011-03-29 23:00:00+00  7.428
< 75747359      1       53      2011-03-29 23:30:00+00  5.757
< 75747360      1       53      2011-03-30 00:00:00+00  6.739
---

> 75747358      1       53      2011-03-28 23:00:00+00  7.428
> 75747359      1       53      2011-03-29 23:30:00+01  5.757
> 75747360      1       54      2011-03-30 00:00:00+00  6.74
5,13c5,13
< 75747362      1       53      2011-03-30 01:00:00+00  6.736   extra-field
< 75747363      1       53      2011-03-30 01:30:00+00  7.576
< 75747364      1       53      2011-03-30 02:00:00+00  6.789
< 75747365      1       53      2011-03-30 02:30:00+00  6.386e+2
< 75747366      1       53      2011-03-30 03:00:00+00  6.016E-2
< 75747367      1       53      2011-03-30 03:30:00+00  6.336
< 75747368      1       53      2011-03-30 04:00:00+00  6.1
< 75747374      1       53      2011-03-30 07:00:00+00  5.9412
< 75747375      1       53      2011-03-30 07:30:00+00  6.137803249
---
> 75747362      1       53      2011-03-30 01:00:00+00  6.73599999999999977
> 75747363      1       53      2011-03-30 01:30:00+00  7.576e+10
> 75747364      1       53      2011-03-30 02:00:00+00  6.789e-10
> 75747365      1       53      2011-03-30 02:30:00+00  6.38600000000000012e+2
> 75747366      1       53      2011-03-30 03:00:00+00  6.01600000000000001E-2
> 75747367      1       53      2011-03-30 03:30:00+00  6.3360000000000003
> 75747368      1       53      2011-03-30 04:00:00+00  6.0999999999999993
> 75747374      1       53      2011-03-30 07:00:00+00  5.94099999999999984
> 75747375      1       53      2011-03-30 07:30:00+00  6.13780324900000007

下面的代码（复制到 github gist：https://gist.github.com/otheus/92162e3a764d2697c3272b98b2663a94）。

#!/bin/awk -f 
## Awk script to compare to SQL (postgres) dumps for which each line of input is a row
## and has been preprocessed by 
##   paste -d $'\a' file1 file2 
## The BEL symbol is used by this program to quickly split the input
##   
## Sometimes, numbers differ by some kind of rounding error / floating-point implementation
## Ignore that error by subtracting the two values and seeing if they are < maxdiff,
##     maxdiff = 1 / (10 ^ (length-after-decimal-point(shortest-value)) 
## Consider:
##   4.2  vs 4.19998
## The shortest number is 4.2, its length is 

## Notes:
##   d is the global *d*iff counter
##   p is the *p*osition / field that first had a difference
##   i is a loop variable,usually current field
##   L is the array of fields from the current line of the *L*eft-file
##   R is  "    "    "    "     "   "    "  "    "   "  "  *R*ight-file
##   clhs is the number of fields in L
##   crhs is the number of fields in R

BEGIN { 
  FS="\a";
  DECIMAL_SEP=".";
  FIELD_SEP="\t";  # for postgresql; for mysql, maybe ", ";
  MAX_DIFFS=10;
  DEBUG=0;
  # Efficiently fill out our table of maximum tolerances of values
  Maxdiffs[1] = 0.1;
  for (i=2; i<31; ++i)
    Maxdiffs[i] = Maxdiffs[i-1] / 10;
  p=-1; # everything starts out fine.
}

# if -v start=...., skip until that line
NR < (0 + start) { next } 

# When pairs don't match, investigate further...
("_" $1) != ("_" $2) {
    if (DEBUG>1) print "Line",NR ": Input lines differed somehow. Investigating...";
    p=0;  # p is field# where difference was found; 0 means whole line
    # split each half into tab-delimited fields
    clhs=split($1,L,FIELD_SEP);
    crhs=split($2,R,FIELD_SEP); 

    if (clhs == crhs) { 
    if (DEBUG>1) print "Line",NR ": Same number of tokens in each line, delimited by '" FIELD_SEP "'";
        ## compare field by field
    p = -1;  # if we don't set p in the loop below, no real differences

    # Compare each field, until a difference is found
    for (i=1; i<=clhs && p<0; ++i) {  
        # Hint: force this compare to be string-based
        if (("_" L[i]) != ("_" R[i])) { 
        if (DEBUG>1) print "Line",NR ": Field",i,"differs somehow";

        ## They differ... but are they numbers?
        if ( \
          L[i] ~ /^-?[0-9]+\.[0-9]+([eE][-+][0-9]+)?$/ && \
          R[i] ~ /^-?[0-9]+\.[0-9]+([eE][-+][0-9]+)?$/ \
        ) {  
            # both fields are floating-point numbers, compare loosely

            # strip exponent part
            sub(/[eE].*/,"",L[i]);sub(/[eE].*/,"",R[i]); 
            # determine precision of shortest value
            precision=( \
                length(L[i]) < length(R[i]) ?  \
            length(L[i]) - index(L[i],DECIMAL_SEP) :  \
            length(R[i]) - index(R[i],DECIMAL_SEP)    \
            ); 
            # look up the maxdiff from our table
            maxdiff=Maxdiffs[precision]; 

            diff=(L[1] - R[1]); 
            if (diff > maxdiff || diff < -maxdiff) {
            if (DEBUG) print "Line",NR ": Numbers differed at",i,"between",L[i],"and",R[i],"differing more than",maxdiff;
            p=i;
            }
            else {
            if (DEBUG) print "Line",NR ": Numbers differed at",i,"between",L[i],"and",R[i],"but differed less than",maxdiff;
            }
        } 
        else {
          if (DEBUG) print "Line",NR ": Strings or ints differed at",i,"between",L[i],"and",R[i];
          p=i;
        }
        }
        else { 
          if (DEBUG) print "Line",NR ": No differences found";
        }
    } 
    }
    # else, field count is different, so whole line is.
    else { 
      if (DEBUG) print "Line",NR ": Number of fields in line differ";
    }
}

p>=0 { 
    ++d;  # bump total diffs count
    # Output a little header for each non-matching records
    print "Line",NR,"diffs found so far:",d,(p ? "here at field: "  p : "" ); 
    # Output the lines that didnt match
    print $1; print $2; print ""; 
    p=-1;
}

# Progress counter
NR % 100000 == 0 { print "Line",NR } 
d > MAX_DIFFS { exit(1);}

请注意，上述代码在发布之前只有一行。

Answer