我有一个数据文件A.tsv
(字段分隔符= \t
):
id clade mutation
243 40A titi,toto,lala
254
267 40B lala,jiji,jojo
和一个模板文件B.tsv
(字段分隔符 = \t
):
40A lala,toto,xixi,xaxa
40B xaxa,jojo,huhu
40C sasa,sisi,lala
基于它们的公共列(进化枝),我想比较A.tsv
模板的突变B.tsv
,并指示在新文件 ( ) 的新列中找到的匹配数,C.tsv
如下所示:
id clade mutation number
243 40A titi,toto,lala 2
254
267 40B lala,jiji,jojo 1
我知道如何比较两个文件,如下所示:
awk -F"," -vOFS="," '
NR==FNR {
a[$2]=$3;
next
}
{ print $0,a[$2] }
' B.tsv A.tsv > C.tsv
但我不知道如何计算比赛。你有好主意吗?
第二个问题:
我想知道如何创建一个新列,其中仅包含有关 中存在多少个突变的信息B.tsv
。total_mut
中的列示例C.tsv
:
id clade mutation number total_mut
243 40A titi,toto,lala 2 4
254
267 40B lala,jiji,jojo 1 3
答案1
使用 GNU的 and awk
(字边界锚):\<
\>
awk 'BEGIN{ FS=OFS="\t" }
NR==FNR{ mutations[$1] =$2; next }
{
split($3, muts, "," );
for(x in muts) { tmp=mutations[$2]; c+=sub( "\\<"muts[x]"\\>", "", tmp) }
}
FNR==1 { c="number" }
{ print $0, (c?c:""); c=0 }' fileB fileA
输出:
id clade mutation number
243 40A titi,toto,lala 2
254
267 40B lala,jiji,jojo 1
第二个要求的更新答案:
awk 'BEGIN{ FS=OFS="\t" }
NR==FNR{ mutations[$1] =$2; next }
{
split($3, muts, "," );
for(x in muts) { tmp=mutations[$2]; c+=sub( "\\<"muts[x]"\\>", "", tmp) }
m=1+gsub(",", "", tmp)
}
FNR==1 { c="number"; m="total_mut" }
{ print $0, (c?c:""), (m>1?m:""); c=m=0 }' fileB fileA
输出:
id clade mutation number total_mut
243 40A titi,toto,lala 2 4
254
267 40B lala,jiji,jojo 1 3
答案2
awk 'BEGIN{ OFS=FS="\t" }
NR==FNR{ clade[$1]=$2; next } # save clade, mutation of B.tsv in array
FNR==1{ print $0, "number"; next } # print header
!($2 in clade){ print; next } # no match -> print record
{ # else...
split($3 "," clade[$2], tmp, ",") # split mutations into tmp array
for (i in tmp) # for all mutations
if (++num[tmp[i]] > 1) # if same mutation occurs more than once
++count # increment counter
print $0, count # print record and count
delete num # reset temporary array
count=0 # reset counter
}
' B.tsv A.tsv > C.tsv
第二个答案:
将第 3 行替换为:
FNR==1{ print $0, "number", "total_mut"; next }
将最后一个替换print
为:
print $0, count, split(clade[$2], tmp, ",")
答案3
方法是从B文件中制作一个由进化枝和突变索引的数组。然后迭代 A 文件中的突变。
处理制表符分隔的文件有点棘手,特别是保留没有进化枝的列数。
我们将 A 文件所需的列号定义为 cClade 和 cMut,并更改它们以匹配完整的数据格式。
对于后续问题,我们保存 split() 已返回的 nMut(突变数),并将其添加到打印(标题和详细信息)中。也测试了这个版本。
#! /bin/bash
Match () { #:: (data, template)
Awk='
BEGIN { FS = "\t"; Sep = ","; cClade = 20; cMut = 41; }
F == "B" {
nMut[$1] = split ($2, V, Sep);
for (j in V) Mut[$1 Sep V[j]];
next;
}
! $2 { printf ("%s%s%s\n", $0, FS, FS); next; }
FNR == 1 { printf ("%s%s%s%s%s\n", $0, FS, "number", FS, "total_mut"); next; }
{
n = 0;
split ($cMut, V, Sep);
for (j in V) if (($cClade Sep V[j]) in Mut) ++n;
printf ("%s%s%s%s%s\n", $0, FS, n, FS, nMut[$cClade]);
}
'
awk -f <( printf '%s' "${Awk}" ) F="B" "${2}" F="A" "${1}"
}
Match useTemplate.A.tsv useTemplate.B.tsv > useTemplate.C.tsv
答案4
这使用 GNU awk 来处理多维数组:
gawk '
BEGIN {
FS = "[\t,]"
OFS = "\t"
}
FILENAME == ARGV[1] {
for (i = 2; i <= NF; i++)
B[$1][$i] = 1
next
}
FNR == 1 {
print $0, "number", "total_mut"
next
}
!($2 in B) {
print
next
}
{
count = 0
for (i = 3; i <= NF; i++)
if ($i in B[$2])
count++
print $0, count, length(B[$2])
}
' {B,A}.tsv
id clade mutation number total_mut
243 40A titi,toto,lala 2 4
254
267 40B lala,jiji,jojo 1 3