我有一个包含这样的模式列表的文件
K00001
K00003
K00005
我想在制表符分隔的表中 grep 并打印我的模式(原始表没有空行),如下所示:
K00001 ko00010_Glycolysis__Gluconeogenesis
K00003 ko00010_Glycolysis__Gluconeogenesis
K00005 ko00010_Glycolysis__Gluconeogenesis
K00001 ko00020_Citrate_cycle_(TCA_cycle)
K00003 ko00020_Citrate_cycle_(TCA_cycle)
K00005 ko00020_Citrate_cycle_(TCA_cycle)
获得这个:一行包含我的模式文件中的所有模式
K00001_ko00010_Glycolysis__Gluconeogenesis;K00001_ko00020_Citrate_cycle_(TCA_cycle)
K00003_ko00010_Glycolysis__Gluconeogenesis;K00003_ko00020_Citrate_cycle_(TCA_cycle)
K00005_ko00010_Glycolysis__Gluconeogenesis;K00005_ko00020_Citrate_cycle_(TCA_cycle)
答案1
该解决方案使用awk。我们将传递两个文件名作为参数,并if (FNR == NR)
根据我们正在读取第一个文件还是第二个文件,使用习惯用法执行不同的操作。我们将使用关联数组存储按键和输出线。
这是文件a.awk
:
# usage: awk -f a.awk keyfile1 datafile2
BEGIN {
FS = "\t" # set field separator to TAB
}
{
if (FNR == NR) { # if looking at first (key) file
k[$1]=$1 # just save each key
} else { # if looking at second file
if ($1 in k) { # if first col is one that we want
output=$1 "_" $2 # prepare output line
if (out[$1]=="") # if first time we've seen this key
out[$1]=output # store output as is
else # and when we find more matches for this key
out[$1]=out[$1] ";" output # we append ";" and the output
}
}
}
END { # at the end
for (i in out) # print all the output lines
print out[i]
}
以下是如何使用它:
$ awk -f a.awk file1 file2
K00001_ko00010_Glycolysis__Gluconeogenesis;K00001_ko00020_Citrate_cycle_(TCA_cycle)
K00003_ko00010_Glycolysis__Gluconeogenesis;K00003_ko00020_Citrate_cycle_(TCA_cycle)
K00005_ko00010_Glycolysis__Gluconeogenesis;K00005_ko00020_Citrate_cycle_(TCA_cycle)