使用 AWK 根据给定条件筛选 2 个文件时出错

使用 AWK 根据给定条件筛选 2 个文件时出错

首先,感谢您的帮助。我使用 AWK 条件过滤 2 个文件时遇到问题。我想过滤的两个文件是:Fasta.fa

>SiiA   lcl|NC_003197.2_prot_NP_463122.1_4111   100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
>SiiA   lcl|NC_010102.1_prot_WP_000389232.1_4169    99.048  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
>SiiA   lcl|CP052796.1_prot_QJV25805.1_4154 97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
>SiiA   lcl|NZ_CP009559.1_prot_WP_000389229.1_1106  97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNNGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
>SiiA   lcl|NZ_CP029897.1_prot_WP_000389235.1_4284  97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKIDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
>SiiA   lcl|NZ_CP053416.1_prot_WP_079774927.1_2027  77.619  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE

species_id(文件较大,包含不同物种的名称)

**Salmonella_enterica_subsp_enterica_Infantis** >lcl|CP052796.1_prot_QJV21904.1_1
**Salmonella_enterica_subsp_enterica_Infantis** >lcl|CP052796.1_prot_QJV21905.1_2
**Salmonella_enterica_subsp_enterica_Infantis** >lcl|CP052796.1_prot_QJV21906.1_3
**Salmonella_enterica_subsp_enterica_Infantis** >lcl|CP052796.1_prot_QJV21907.1_4
**Salmonella_enterica_subsp_enterica_Infantis** >lcl|CP052796.1_prot_QJV21908.1_5
**Salmonella_enterica_subsp_enterica_Infantis** >lcl|CP052796.1_prot_QJV26199.1_6
**Salmonella_enterica_subsp_enterica_Infantis** >lcl|CP052796.1_prot_QJV21909.1_7

我想使用 awk,这样如果两个文件中的 $2 相同,它就会在 fasta.fa 中输入物种名称,因此新文件中的输出将类似于此:

SiiA    **Salmonella_enterica_subsp_enterica_Typhimurium_LT2**  lcl|NC_003197.2_prot_NP_463122.1_4111   100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    **Salmonella_enterica_subsp_enterica_Paratyphi_B**  lcl|NC_010102.1_prot_WP_000389232.1_4169    99.048  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    **Salmonella_enterica_subsp_enterica_Infantis** lcl|CP052796.1_prot_QJV25805.1_4154 97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIESKTKSTAQNSGANDNSNANEIINKEVNTQDMSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    **Salmonella_enterica_subsp_enterica_Paratyphi_A**  lcl|NZ_CP009559.1_prot_WP_000389229.1_1106  97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNNGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    **Salmonella_enterica_subsp_enterica_Typhi**    lcl|NZ_CP029897.1_prot_WP_000389235.1_4284  97.143  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDNSNANEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKIDITSTKNELVITYHGRLRSFSEEDTHKIEAWLEDKTNSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiA    **Salmonella_bongori**  lcl|NZ_CP053416.1_prot_WP_079774927.1_2027  77.619  100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE

“**” 不在文件中,我只是想把它们放进去向大家展示我在做什么。我试过这两个代码,但没有一个能给出我期望的结果

awk 'FNR==NR{a[NR]=$0;next}{$2=a[FNR]}1' species_id fasta.fa >> final
awk 'NR==FNR {a[$2]=$1; next} $1 in a {$3=$4;$2=$3;$2=a[$1];$4=$5;$5=$6}1' species_id fasta.fa >> final

答案1

如果您的数据正确,此代码应该可以工作。文件 1 和文件 2 中的第二列应该匹配,但目前发布的样本并不匹配。

awk 'FNR==NR{a[$2]=$0; next} ($2 in a){print a[$2]" "$0}' file1 file2

相关内容