查找具有特定字符串的连续行并根据表修改文件

Question

使用您的链接文件，这可以工作

awk 'BEGIN{FS="[ =]+"; OFS="="}
     BEGINFILE{fno++}
     fno==1{locus["\""$1"\""]="\""$2"\""; }
     fno>1{if (old ~ /LOCUS/ && $0 !~ /gene/) print "/gene", locus[old]; old=$3; print}
    ' table file1

前

                     /locus_tag="LOCUS_00030"
                     /note="WP_011108293.1 hypothetical protein (Bacteroides

后

                     /locus_tag="LOCUS_00030"
/gene="BT_2429"
                     /note="WP_011108293.1 hypothetical protein (Bacteroides

由于您不熟悉awk演练

awk 'BEGIN{FS="[ =]+"; OFS="="}
# set up the input field separator as any group of spaces and/or =
# and set the output field separator as =

     BEGINFILE{fno++}
     # Whenever you open a file, increment the file counter fno

     fno==1{locus["\""$1"\""]="\""$2"\""; }
     # if this is the first file (i.e. table) load the array `locus[]`
     # but wrap the fields in "..." so that they are exactly like the data file entries

     fno>1{if (old ~ /LOCUS/ && $0 !~ /gene/) print "/gene", locus[old]; old=$3; print}
     # if this is a data file
     # if the current value of old (i.e. the previous line) is a LOCUS
     # and && this line ($0) isn't a gene
     # add a gene by indexing into the locus array based upon the value of old
     # because old contains the last LOCUS we found
     # in all cases
     #    set old to the 3rd field on the current line,
     #       which on any LOCUS line is the string "LOCUS_?????" and
     #    print the current line
     # See note below re $2 vs $3 and FS

    ' table file1
    # your input files, table must be first, you can have more data files if you want

或者如果没有多字符，FS则保留，old=$2因为它不会在数据文件中的文本之前的空白处中断，而多字符会这样做。

下面根据您正在读取的文件设置字段分隔符FS=(fno==1)?" ":"="。表和=数据的空间

awk 'BEGIN{OFS="="}
     BEGINFILE{fno++;FS=(fno==1)?" ":"="}
     fno==1{locus["\""$1"\""]="\""$2"\""; }
     fno>1{if (old ~ /LOCUS/ && $0 !~ /gene/) print "/gene", locus[old]; old=$2; print}
    ' table file1

前提是表文件没有大到占用内存。

并进行测试，在缺失的基因处插入一条信息，如果它不仅仅适合空的基因/gene=

fno>1{if (old ~ /LOCUS/ && $0 !~ /gene/) print "/gene", (old in locus)?locus[old]:"\"GENE_MISSING_AT_LOCUS\""; old=$3; print}

更改字段引用以old匹配FS您正在使用的版本

                     /locus_tag="LOCUS_00020"
/gene="GENE_MISSING_AT_LOCUS"
                     /note="WP_008765457.1 hypothetical protein (Bacteroides

编辑

查看您链接到的示例文件，上面的示例与实际数据之间的格式差异只是一个问题，这与字段编号混淆了。old=$2只需更改为old=$3.上面已更正。

Answer 1