我有一个很大的蛋白质序列文件,其中每个序列名称都是使用>
以下行中的相应序列来识别的。
示例(忽略引号):
>YAL003W EFB1 SGDID:S000000003, Chr I from 142174-142253,142620-143160, Genome Release 64-1-1, Verified ORF, "Translation elongation factor 1 beta; stimulates nucleotide exchange to regenerate EF-1 alpha-GTP for the next elongation cycle; part of the EF-1 complex, which facilitates binding of aminoacyl-tRNA to the ribosomal A site"
MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD
EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK
SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS
LDDLQQSIEEDEDHVQSTDIAAMQKL*
我想删除大部分名称文本,使其看起来像这样(忽略引号):
>YAL003W EFB1
MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD
FDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK
SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS
LDDLQQSIEEDEDHVQSTDIAAMQKL*
名称仅算作一行,而序列算作多行,因此是我的问题。我该如何解决这个问题?
答案1
awk解决方案
$ awk '/>/ { print $1, $2; next } { print }' aa
>YAL003W EFB1
MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD
EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK
SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS
LDDLQQSIEEDEDHVQSTDIAAMQKL*
- />/ 在一行中搜索 >
- 下一个 ;不读取 awk 文件中的任何进一步模式
答案2
以下是一些解决方案:
grep
。该模式搜索以>
2 个非空格 ([^ ]+ [^ ]+
) 序列或任何字符 (.+
) 开头的行。只打印每行匹配部分的-o
原因:grep
$ grep -oP '^(>[^ ]+ [^ ]+|.+)' file.fa >YAL003W EFB1 MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS LDDLQQSIEEDEDHVQSTDIAAMQKL*
awk
$ awk '{if(/>/){print $1,$2}else{print}}' file.fa >YAL003W EFB1 MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS LDDLQQSIEEDEDHVQSTDIAAMQKL*
GNU
sed
$ sed -r 's/(>[^ ]+ [^ ]+).*/\1/' file.fa >YAL003W EFB1 MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS LDDLQQSIEEDEDHVQSTDIAAMQKL*
任何
sed
$ sed 's/\(>[^ ]* [^ ]*\).*/\1/' file.fa >YAL003W EFB1 MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS LDDLQQSIEEDEDHVQSTDIAAMQKL*
cut
$ cut -d ' ' -f 1,2 file.fa >YAL003W EFB1 MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS LDDLQQSIEEDEDHVQSTDIAAMQKL*
珀尔。
$ perl -lane 'print "@F[0..1]"' file.fa >YAL003W EFB1 MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS LDDLQQSIEEDEDHVQSTDIAAMQKL*
选项有
l
:从每个输入行中删除尾随换行符,并向每个打印调用添加换行符。a
:将空白处的每个输入行拆分到@F
数组中。n
:逐行读取输入文件。e
:在每一行运行此脚本。
脚本本身只是打印第一个和第二个字段。对于序列行,它只会打印唯一可用的字段,即第一个字段。这是整条线。