我想删除不在括号内的所有内容,包括括号,仅在以“>”开头的行中。有 sed 替代品吗?另外,想按字母顺序对行进行排序,即以“>”开头的行及其下一行。
输入示例:
>ID:000:FLKLNFIA_00192 |[Ignicoccus_hospitalis_KIN4-I.gbfspecies]|strain|Ignicoccus_hospitalis_KIN4-I.gbf|LSU ribosomal protei..|447|FLKLNFIA_1(1297538):162644-163090:1 ^^ Archaeagenomesparanahui Ignicoccus_hospitalis_KIN4-I.gbfspecies strain strain.|neighbours:ID:000:FLKLNFIA_00191(1),ID:000:FLKLNFIA_00193(1)|neighbour_genes:LSU ribosomal protei..,SSU ribosomal protei..|
ATGAGTGTGACTA---TTT---GCAATCAGCTAGCTACTACGTACTGATCGTAGCTGACG
>ID:000:MGCDKLCO_01184 |[Archaeoglobus_fulgidus_DSM_4304.gbfspecies]|strain|Archaeoglobus_fulgidus_DSM_4304.gbf|50S ribosomal protei..|471|MGCDKLCO_1(2178400):1005279-1005749:1 ^^ Archaeagenomesparanahui Archaeoglobus_fulgidus_DSM_4304.gbfspecies strain strain.|neighbours:ID:000:MGCDKLCO_01183(1),ID:000:MGCDKLCO_01185(1)|neighbour_genes:LSU ribosomal protei..,SSU ribosomal protei..|
ATGCGCGCGATAGCTAGCTAGCTAGCTTTAGGGGGATTAGCTA----ACTCTGATTCGGA
预期输出:
>Archaeoglobus_fulgidus_DSM_4304.gbfspecies
ATGCGCGCGATAGCTAGCTAGCTAGCTTTAGGGGGATTAGCTA----ACTCTGATTCGGA
>Ignicoccus_hospitalis_KIN4-I.gbfspecies
ATGAGTGTGACTA---TTT---GCAATCAGCTAGCTACTACGTACTGATCGTAGCTGACG
谢谢
答案1
和perl
:
perl -ne 'push @l, ">" . join("", /\[(.*?)\]/g) . "\n" . <>;
END{print for sort @l}' your-file
和sed
:
<your-file sed 's/^[^[]*\[/>/
s/\][^[]*\[\{0,1\}//g
N;s/\n/\[/' |
sort |
tr '[' '\n'
答案2
我的(复杂的)建议:
cat file | grep -Po "^[CGTA-]*$|^>.*$" | grep -Po "(?<=\[).*(?=])|^[ACGT-]*$" | awk '{printf (NR%2==0) ? $0 "\n" : ">"$0"::"}' | sort | sed 's/#/\n/'
Grep 仅包含包含字符的行CGTA-
以及以以下字符开头的行>
grep -Po "^[CGTA-]*$|^>.*$"
仅 Grep 括号内的内容(排除它们)以及与模式匹配的行ACGT-
| grep -Po "(?<=\[).*(?=])|^[ACGT-]*$"
连接每两行,添加分隔符#
和首字母>
,然后排序
| awk '{printf (NR%2==0) ? $0 "\n" : ">"$0"#"}' | sort
#
最后用新行替换分隔符
| sed 's/#/\n/'
输出:
>Archaeoglobus_fulgidus_DSM_4304.gbfspecies
ATGCGCGCGATAGCTAGCTAGCTAGCTTTAGGGGGATTAGCTA----ACTCTGATTCGGA
>Ignicoccus_hospitalis_KIN4-I.gbfspecies
ATGAGTGTGACTA---TTT---GCAATCAGCTAGCTACTACGTACTGATCGTAGCTGACG