如何使用awk比较两个文件的两列并打印匹配的数量

如何使用awk比较两个文件的两列并打印匹配的数量

我有一个数据文件A.tsv(字段分隔符= \t):

id  clade   mutation
243 40A titi,toto,lala
254     
267 40B lala,jiji,jojo

和一个模板文件B.tsv(字段分隔符 = \t):

40A lala,toto,xixi,xaxa
40B xaxa,jojo,huhu
40C sasa,sisi,lala

基于它们的公共列(进化枝),我想比较A.tsv模板的突变B.tsv,并指示在新文件 ( ) 的新列中找到的匹配数,C.tsv如下所示:

id  clade   mutation    number
243 40A titi,toto,lala  2
254     
267 40B lala,jiji,jojo  1

我知道如何比较两个文件,如下所示:

awk -F"," -vOFS="," '    
    NR==FNR {
     a[$2]=$3;
     next
    }
    
    { print $0,a[$2] }
' B.tsv A.tsv > C.tsv

但我不知道如何计算比赛。你有好主意吗?

第二个问题:

我想知道如何创建一个新列,其中仅包含有关 中存在多少个突变的信息B.tsvtotal_mut中的列示例C.tsv

id  clade   mutation    number  total_mut
243 40A titi,toto,lala  2   4
254     
267 40B lala,jiji,jojo  1   3

答案1

使用 GNU的 and awk(字边界锚):\<\>

awk 'BEGIN{ FS=OFS="\t" }
NR==FNR{ mutations[$1] =$2; next }

{
    split($3, muts, "," );
    for(x in muts) { tmp=mutations[$2]; c+=sub( "\\<"muts[x]"\\>", "", tmp) }
}

FNR==1 { c="number" }
{ print $0, (c?c:""); c=0 }' fileB  fileA

输出:

id      clade   mutation        number
243     40A     titi,toto,lala  2
254
267     40B     lala,jiji,jojo  1

第二个要求的更新答案:

awk 'BEGIN{ FS=OFS="\t" }
NR==FNR{ mutations[$1] =$2; next }

{
    split($3, muts, "," );
    for(x in muts) { tmp=mutations[$2]; c+=sub( "\\<"muts[x]"\\>", "", tmp) }
    m=1+gsub(",", "", tmp) 
}

FNR==1 { c="number"; m="total_mut" }
{ print $0, (c?c:""), (m>1?m:""); c=m=0 }' fileB  fileA

输出:

id      clade   mutation        number  total_mut
243     40A     titi,toto,lala  2       4
254
267     40B     lala,jiji,jojo  1       3

答案2

awk 'BEGIN{ OFS=FS="\t" }
  NR==FNR{ clade[$1]=$2; next }         # save clade, mutation of B.tsv in array
  FNR==1{ print $0, "number"; next }    # print header
  !($2 in clade){ print; next }         # no match -> print record
  {                                     # else...
     split($3 "," clade[$2], tmp, ",")  # split mutations into tmp array
     for (i in tmp)                     # for all mutations
       if (++num[tmp[i]] > 1)           # if same mutation occurs more than once
         ++count                        # increment counter

     print $0, count                    # print record and count
     delete num                         # reset temporary array
     count=0                            # reset counter
  }
' B.tsv A.tsv > C.tsv

第二个答案:

将第 3 行替换为:

FNR==1{ print $0, "number", "total_mut"; next }

将最后一个替换print为:

print $0, count, split(clade[$2], tmp, ",")

答案3

方法是从B文件中制作一个由进化枝和突变索引的数组。然后迭代 A 文件中的突变。

处理制表符分隔的文件有点棘手,特别是保留没有进化枝的列数。

我们将 A 文件所需的列号定义为 cClade 和 cMut,并更改它们以匹配完整的数据格式。

对于后续问题,我们保存 split() 已返回的 nMut(突变数),并将其添加到打印(标题和详细信息)中。也测试了这个版本。

#! /bin/bash

Match () {  #:: (data, template)

    Awk='
BEGIN { FS = "\t"; Sep = ","; cClade = 20; cMut = 41; }
F == "B" {
    nMut[$1] = split ($2, V, Sep);
    for (j in V) Mut[$1 Sep V[j]];
    next;
}
! $2 { printf ("%s%s%s\n", $0, FS, FS); next; }
FNR == 1 { printf ("%s%s%s%s%s\n", $0, FS, "number", FS, "total_mut"); next; }
{
    n = 0;
    split ($cMut, V, Sep);
    for (j in V) if (($cClade Sep V[j]) in Mut) ++n;
    printf ("%s%s%s%s%s\n", $0, FS, n, FS, nMut[$cClade]);
}
'
    awk -f <( printf '%s' "${Awk}" ) F="B" "${2}" F="A" "${1}"
}

    Match useTemplate.A.tsv useTemplate.B.tsv > useTemplate.C.tsv

答案4

这使用 GNU awk 来处理多维数组:

gawk '
    BEGIN {
        FS = "[\t,]"
        OFS = "\t"
    }
    FILENAME == ARGV[1] {
        for (i = 2; i <= NF; i++)
            B[$1][$i] = 1
        next
    }
    FNR == 1 {
        print $0, "number", "total_mut"
        next
    }
    !($2 in B) {
        print
        next
    }
    {
        count = 0
        for (i = 3; i <= NF; i++)
            if ($i in B[$2])
                count++
        print $0, count, length(B[$2])
    }
' {B,A}.tsv
id      clade   mutation        number  total_mut
243     40A     titi,toto,lala  2       4
254
267     40B     lala,jiji,jojo  1       3

相关内容