使用 Unix 和 Awk 比较两个文件

Question 1

您可以使用awk。将以下内容放入脚本中script.awk：

FNR == NR {
  f1[$1,$2,$4] = $0
  f1_c14[$1,$2,$4] = 1
  f1_c5[$1,$2,$4] = $5
  next
}  

f1_c14[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}

f1[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print $0;
}

现在像这样运行它：

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4

该脚本的工作原理如下。该块创建 3 个数组、f1、f1_c14和f1_c5。f1包含数组中 file1 的所有行，使用 file1 中第 1、2 和 4 列的内容进行索引。f1_c14是另一个具有相同索引（1、2 和 4 的内容）且值为的数组1。第三个数组使用与第一个 2 相同的索引，其值为 file1 中第 5 列的值。

FNR == NR {
  f1[$1,$2,$4] = $0
  f1_c14[$1,$2,$4] = 1
  f1_c5[$1,$2,$4] = $5
  next
}

下一个块负责打印第一个文件中的行，file1条件是列 1、2 和 4 与中的列匹配file2，和如果和file1的第 5 列不匹配，它只会打印该行。file1file2

f1_c14[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}

第三个块负责打印数组中第 1、2 和 4 列的对应行中的关联行。同样，它仅在第 5 列不匹配时才打印file2。f1file2

f1[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print $0;
}

例子

像这样运行上面的脚本：

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4

您可以使用以下column命令稍微清理输出：

$ awk -f script.awk file1  file2 | column -t
sc2/80  20  .  A  T  86  PASS  N=2  F=5;U=4
sc2/80  20  .  A  C  80  PASS  N=2  F=5;U=4
sc2/60  55  .  G  T  76  PASS  N=2  F=5;U=4
sc2/60  55  .  G  C  72  PASS  N=2  F=5;U=4

怎么运行的？

FNR == NR

这利用了awk以特定方式循环文件的能力。这里我们正在循环遍历文件，当我们位于第一个文件中的一行时，file我们希望在该行上运行来自的特定代码块file1。

这个例子展示了FNR == NR当我们给它 2 个模拟文件时它正在做什么。其中一个有 4 行，而另一个有 5 行：

$ awk 'BEGIN {print "NR\tFNR\tline"} {print NR"\t"FNR"\t"$0}' \
     <(seq 1 4) <(seq 1 5)
NR  FNR line
1   1   1
2   2   2
3   3   3
4   4   4
5   1   1
6   2   2
7   3   3
8   4   4
9   5   5

其他区块

其他块f1_c14[$1,$2,$4]ANDf1[$1,$2,$4]仅在这些数组元素的值具有值时运行。

Answer

您可以使用awk。将以下内容放入脚本中script.awk：

FNR == NR {
  f1[$1,$2,$4] = $0
  f1_c14[$1,$2,$4] = 1
  f1_c5[$1,$2,$4] = $5
  next
}  

f1_c14[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}

f1[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print $0;
}

现在像这样运行它：

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4

该脚本的工作原理如下。该块创建 3 个数组、f1、f1_c14和f1_c5。f1包含数组中 file1 的所有行，使用 file1 中第 1、2 和 4 列的内容进行索引。f1_c14是另一个具有相同索引（1、2 和 4 的内容）且值为的数组1。第三个数组使用与第一个 2 相同的索引，其值为 file1 中第 5 列的值。

FNR == NR {
  f1[$1,$2,$4] = $0
  f1_c14[$1,$2,$4] = 1
  f1_c5[$1,$2,$4] = $5
  next
}

下一个块负责打印第一个文件中的行，file1条件是列 1、2 和 4 与中的列匹配file2，和如果和file1的第 5 列不匹配，它只会打印该行。file1file2

f1_c14[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}

第三个块负责打印数组中第 1、2 和 4 列的对应行中的关联行。同样，它仅在第 5 列不匹配时才打印file2。f1file2

f1[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print $0;
}

例子

像这样运行上面的脚本：

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4

您可以使用以下column命令稍微清理输出：

$ awk -f script.awk file1  file2 | column -t
sc2/80  20  .  A  T  86  PASS  N=2  F=5;U=4
sc2/80  20  .  A  C  80  PASS  N=2  F=5;U=4
sc2/60  55  .  G  T  76  PASS  N=2  F=5;U=4
sc2/60  55  .  G  C  72  PASS  N=2  F=5;U=4

怎么运行的？

FNR == NR

这利用了awk以特定方式循环文件的能力。这里我们正在循环遍历文件，当我们位于第一个文件中的一行时，file我们希望在该行上运行来自的特定代码块file1。

这个例子展示了FNR == NR当我们给它 2 个模拟文件时它正在做什么。其中一个有 4 行，而另一个有 5 行：

$ awk 'BEGIN {print "NR\tFNR\tline"} {print NR"\t"FNR"\t"$0}' \
     <(seq 1 4) <(seq 1 5)
NR  FNR line
1   1   1
2   2   2
3   3   3
4   4   4
5   1   1
6   2   2
7   3   3
8   4   4
9   5   5

其他区块

其他块f1_c14[$1,$2,$4]ANDf1[$1,$2,$4]仅在这些数组元素的值具有值时运行。

Question 2

这是 Perl 中的解决方案。您应该将以下代码保存在文件中并将其作为脚本运行（见下文）：

#!/usr/bin/perl
$file1 = '/path/to/file1';
$file2 = '/path/to/file2';
open $f1,'<',$file1;
open $f2,'<',$file2;
while(<$f1>){
    ($c1,$c2,$c4,$c5) = (split / /)[0,1,3,4]; #get relevant columns in file 1
    $lines_dictionary{"$c1 $c2 $c4"}="$c5---$_"; #create a hash entry keyed by the relevant columns
}
while(<$f2>){
    ($c1,$c2,$c4,$c5) = (split / /)[0,1,3,4]; #get relevant columns in file 2
    if(exists $lines_dictionary{"$c1 $c2 $c4"}){ #if a line with similar columns was seen in file 1
        ($file1_c5,$file1_line) = split /---/,$lines_dictionary{"$c1 $c2 $c4"}; #parse the hash entry this line in file 1
        if($file1_c5 -ne $c5){ #if column 5 of file 2 doesn't match column 5 of file 1
            print "${file1_line}$_\n"; #we only need one extra newline as the lines read from the files have trailing ones.
        }
    }
}
close $f1;
close $f2;

使用任何文本编辑器将此脚本粘贴到文件中，修改$file1和$file2变量以反映文件的真实位置，然后通过执行以下操作使脚本可执行：

$ chmod +x /path/to/script

最后调用脚本：

$ /path/to/script

免责声明

此代码未经测试
此代码假设模式“---”不太可能出现在第 5 列中。
此代码假定文件 1 中的行是唯一的（即每行都有不同的“column1、column2、column4”组合）。如果相关列中有多行（不一定是连续的）包含相同的数据，则脚本将使用这些行的最后一行（文件中的最底部）。

Answer

这是 Perl 中的解决方案。您应该将以下代码保存在文件中并将其作为脚本运行（见下文）：

#!/usr/bin/perl
$file1 = '/path/to/file1';
$file2 = '/path/to/file2';
open $f1,'<',$file1;
open $f2,'<',$file2;
while(<$f1>){
    ($c1,$c2,$c4,$c5) = (split / /)[0,1,3,4]; #get relevant columns in file 1
    $lines_dictionary{"$c1 $c2 $c4"}="$c5---$_"; #create a hash entry keyed by the relevant columns
}
while(<$f2>){
    ($c1,$c2,$c4,$c5) = (split / /)[0,1,3,4]; #get relevant columns in file 2
    if(exists $lines_dictionary{"$c1 $c2 $c4"}){ #if a line with similar columns was seen in file 1
        ($file1_c5,$file1_line) = split /---/,$lines_dictionary{"$c1 $c2 $c4"}; #parse the hash entry this line in file 1
        if($file1_c5 -ne $c5){ #if column 5 of file 2 doesn't match column 5 of file 1
            print "${file1_line}$_\n"; #we only need one extra newline as the lines read from the files have trailing ones.
        }
    }
}
close $f1;
close $f2;

使用任何文本编辑器将此脚本粘贴到文件中，修改$file1和$file2变量以反映文件的真实位置，然后通过执行以下操作使脚本可执行：

$ chmod +x /path/to/script

最后调用脚本：

$ /path/to/script

免责声明

此代码未经测试
此代码假设模式“---”不太可能出现在第 5 列中。
此代码假定文件 1 中的行是唯一的（即每行都有不同的“column1、column2、column4”组合）。如果相关列中有多行（不一定是连续的）包含相同的数据，则脚本将使用这些行的最后一行（文件中的最底部）。

使用 Unix 和 Awk 比较两个文件

答案1

例子

怎么运行的？

答案2

相关内容