我有一个无法解决的文本操作问题。假设我有一个如下所示的文本文件(text.txt)。在某些情况下,一行 with 后面/locus_tag
跟着一行 with /gene
,而另一些情况则不是。我想找到所有后面/locus_tag
没有跟随的行/gene
,然后使用像下面这样的表(table.txt)来匹配/locus_tag
a/gene
并将其添加/gene
到我的文本文件中的 .txt 之后/locus_tag
。
任何关于如何做到这一点的想法都会很棒。
/locus_tag="LOCUS_23770"
/note="ABC"
/locus_tag="LOCUS_23780"
/note="DEF"
/locus_tag="LOCUS_23980"
/note="GHI"
/locus_tag="LOCUS_24780"
/gene="BT_4758"
/note="ONP"
/locus_tag="LOCUS_25780"
/gene="BT_4768"
/note="WZX"
桌子
/locus_tag /gene
LOCUS_00010 BT_4578
LOCUS_00020 BT_4577
LOCUS_00030 BT_2429
答案1
使用您的链接文件,这可以工作
awk 'BEGIN{FS="[ =]+"; OFS="="}
BEGINFILE{fno++}
fno==1{locus["\""$1"\""]="\""$2"\""; }
fno>1{if (old ~ /LOCUS/ && $0 !~ /gene/) print "/gene", locus[old]; old=$3; print}
' table file1
前
/locus_tag="LOCUS_00030"
/note="WP_011108293.1 hypothetical protein (Bacteroides
后
/locus_tag="LOCUS_00030"
/gene="BT_2429"
/note="WP_011108293.1 hypothetical protein (Bacteroides
由于您不熟悉awk
演练
awk 'BEGIN{FS="[ =]+"; OFS="="}
# set up the input field separator as any group of spaces and/or =
# and set the output field separator as =
BEGINFILE{fno++}
# Whenever you open a file, increment the file counter fno
fno==1{locus["\""$1"\""]="\""$2"\""; }
# if this is the first file (i.e. table) load the array `locus[]`
# but wrap the fields in "..." so that they are exactly like the data file entries
fno>1{if (old ~ /LOCUS/ && $0 !~ /gene/) print "/gene", locus[old]; old=$3; print}
# if this is a data file
# if the current value of old (i.e. the previous line) is a LOCUS
# and && this line ($0) isn't a gene
# add a gene by indexing into the locus array based upon the value of old
# because old contains the last LOCUS we found
# in all cases
# set old to the 3rd field on the current line,
# which on any LOCUS line is the string "LOCUS_?????" and
# print the current line
# See note below re $2 vs $3 and FS
' table file1
# your input files, table must be first, you can have more data files if you want
或者如果没有多字符,FS
则保留,old=$2
因为它不会在数据文件中的文本之前的空白处中断,而多字符会这样做。
下面根据您正在读取的文件设置字段分隔符FS=(fno==1)?" ":"="
。表和=
数据的空间
awk 'BEGIN{OFS="="}
BEGINFILE{fno++;FS=(fno==1)?" ":"="}
fno==1{locus["\""$1"\""]="\""$2"\""; }
fno>1{if (old ~ /LOCUS/ && $0 !~ /gene/) print "/gene", locus[old]; old=$2; print}
' table file1
前提是表文件没有大到占用内存。
并进行测试,在缺失的基因处插入一条信息,如果它不仅仅适合空的基因/gene=
fno>1{if (old ~ /LOCUS/ && $0 !~ /gene/) print "/gene", (old in locus)?locus[old]:"\"GENE_MISSING_AT_LOCUS\""; old=$3; print}
更改字段引用以old
匹配FS
您正在使用的版本
/locus_tag="LOCUS_00020"
/gene="GENE_MISSING_AT_LOCUS"
/note="WP_008765457.1 hypothetical protein (Bacteroides
编辑
查看您链接到的示例文件,上面的示例与实际数据之间的格式差异只是一个问题,这与字段编号混淆了。old=$2
只需更改为old=$3
.上面已更正。