我有一个制表符分隔的文件,其中有一列有多个用逗号分隔的值,我想在其中替换查找表中的值
查找文件:
ID Name
g_00001 g_00001
g_00002 cytA
g_00003 g_00003
g_00004 mntB
g_00005 recF
g_00006 gyaN
g_00007 traR
g_00008 g_00008
g_00009 g_00009
g_00010 hypE
输入文件:
Name Start Stop Strand Number of Genes Genes
op00001 1544 5454 + 2 cytA, g_00001
op00002 7026 12012 + 2 recF, mntB
op00003 15215 16854 - 3 g_00010,cytA, g_00009
op00004 19856 25454 - 2 hypE, g_00020
op00005 20791 23568 + 2 gyaN, g_00005
输出文件:
Name Start Stop Strand Number of Genes Genes
op00001 1544 5454 + 2 g_00002, g_00001
op00002 7026 12012 + 2 g_00005, g_00004
op00003 15215 16854 - 3 g_00010, g_00002, g_00009
op00004 19856 25454 - 2 g_00010, g_00020
op00005 20791 23568 + 2 g_00006, g_00005
根据这里的一些例子,我尝试了以下代码
awk -F';' 'NR==FNR{a[$2]=$1;next}{$6=a[$1]}1' lookup input
它不会改变任何东西。
另一种方法是使用 sed -i 's/cytA/g_00002/' 逐一尝试并为每一行创建 sed 文件并循环运行它是我的想法,但我想检查是否有更好的方法来做到这一点。
答案1
这些“用逗号分隔的多个值”是用逗号和(在大多数情况下但不是全部情况下)空格分隔的,这并不能让处理它们变得更容易。尝试调整字段分隔符并将每个基因作为单个字段进行操作:
awk -F"[, \t]*" '
NR==FNR {a[$2] = $1
next
}
{for (i=6; i<=NF; i++) if ($i in a) sub($i, a[$i])
}
1
' OFS="\t" Lookup_file input_file
Name Start Stop Strand Number of Genes Genes
op00001 1544 5454 + 2 g_00002, g_00001
op00002 7026 12012 + 2 g_00005, g_00004
op00003 15215 16854 - 3 g_00010,g_00002, g_00009
op00004 19856 25454 - 2 g_00010, g_00020
op00005 20791 23568 + 2 g_00006, g_00005
答案2
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR == FNR {
map[$2] = $1
next
}
{
n = split($NF,g,/[[:space:]]*,[[:space:]]*/)
out = ""
for ( i=1; i<=n; i++ ) {
out = (i>1 ? out ", " : "") (g[i] in map ? map[g[i]] : g[i])
}
$NF = out
print
}
$ awk -f tst.awk lookup_file input_file
Name Start Stop Strand Number of Genes Genes
op00001 1544 5454 + 2 g_00002, g_00001
op00002 7026 12012 + 2 g_00005, g_00004
op00003 15215 16854 - 3 g_00010, g_00002, g_00009
op00004 19856 25454 - 2 g_00010, g_00020
op00005 20791 23568 + 2 g_00006, g_00005