首先,感谢您的帮助。我使用 AWK 条件过滤 2 个文件时遇到问题。我想过滤的两个文件是:Fasta.fa
>SiiA lcl|NC_003197.2_prot_NP_463122.1_4111 100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
>SiiA lcl|NC_010102.1_prot_WP_000389232.1_4169 99.048 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
>SiiA lcl|CP052796.1_prot_QJV25805.1_4154 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
>SiiA lcl|NZ_CP009559.1_prot_WP_000389229.1_1106 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNNGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
>SiiA lcl|NZ_CP029897.1_prot_WP_000389235.1_4284 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKIDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
>SiiA lcl|NZ_CP053416.1_prot_WP_079774927.1_2027 77.619 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE
species_id(文件较大,包含不同物种的名称)
**Salmonella_enterica_subsp_enterica_Infantis** >lcl|CP052796.1_prot_QJV21904.1_1
**Salmonella_enterica_subsp_enterica_Infantis** >lcl|CP052796.1_prot_QJV21905.1_2
**Salmonella_enterica_subsp_enterica_Infantis** >lcl|CP052796.1_prot_QJV21906.1_3
**Salmonella_enterica_subsp_enterica_Infantis** >lcl|CP052796.1_prot_QJV21907.1_4
**Salmonella_enterica_subsp_enterica_Infantis** >lcl|CP052796.1_prot_QJV21908.1_5
**Salmonella_enterica_subsp_enterica_Infantis** >lcl|CP052796.1_prot_QJV26199.1_6
**Salmonella_enterica_subsp_enterica_Infantis** >lcl|CP052796.1_prot_QJV21909.1_7
我想使用 awk,这样如果两个文件中的 $2 相同,它就会在 fasta.fa 中输入物种名称,因此新文件中的输出将类似于此:
SiiA **Salmonella_enterica_subsp_enterica_Typhimurium_LT2** lcl|NC_003197.2_prot_NP_463122.1_4111 100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA **Salmonella_enterica_subsp_enterica_Paratyphi_B** lcl|NC_010102.1_prot_WP_000389232.1_4169 99.048 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA **Salmonella_enterica_subsp_enterica_Infantis** lcl|CP052796.1_prot_QJV25805.1_4154 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA **Salmonella_enterica_subsp_enterica_Paratyphi_A** lcl|NZ_CP009559.1_prot_WP_000389229.1_1106 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNNGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA **Salmonella_enterica_subsp_enterica_Typhi** lcl|NZ_CP029897.1_prot_WP_000389235.1_4284 97.143 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKIDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA **Salmonella_bongori** lcl|NZ_CP053416.1_prot_WP_079774927.1_2027 77.619 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE
“**” 不在文件中,我只是想把它们放进去向大家展示我在做什么。我试过这两个代码,但没有一个能给出我期望的结果
awk 'FNR==NR{a[NR]=$0;next}{$2=a[FNR]}1' species_id fasta.fa >> final
awk 'NR==FNR {a[$2]=$1; next} $1 in a {$3=$4;$2=$3;$2=a[$1];$4=$5;$5=$6}1' species_id fasta.fa >> final
答案1
如果您的数据正确,此代码应该可以工作。文件 1 和文件 2 中的第二列应该匹配,但目前发布的样本并不匹配。
awk 'FNR==NR{a[$2]=$0; next} ($2 in a){print a[$2]" "$0}' file1 file2