用一组新名称替换行名称,而不影响文件的其余部分

用一组新名称替换行名称,而不影响文件的其余部分

我有一个很大的蛋白质序列文件,其中每个序列名称都是使用>以下行中的相应序列来识别的。

示例(忽略引号):

>YAL003W EFB1 SGDID:S000000003, Chr I from 142174-142253,142620-143160, Genome Release 64-1-1, Verified ORF, "Translation elongation factor 1 beta; stimulates nucleotide exchange to regenerate EF-1 alpha-GTP for the next elongation cycle; part of the EF-1 complex, which facilitates binding of aminoacyl-tRNA to the ribosomal A site"
MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD
EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK
SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS
LDDLQQSIEEDEDHVQSTDIAAMQKL*

我想删除大部分名称文本,使其看起来像这样(忽略引号):

>YAL003W EFB1
MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD
FDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK
SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS
LDDLQQSIEEDEDHVQSTDIAAMQKL*

名称仅算作一行,而序列算作多行,因此是我的问题。我该如何解决这个问题?

答案1

awk解决方案

$ awk '/>/ { print $1, $2; next } { print }' aa
>YAL003W EFB1
MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD
EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK
SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS
LDDLQQSIEEDEDHVQSTDIAAMQKL*
  • />/ 在一行中搜索 >
  • 下一个 ;不读取 awk 文件中的任何进一步模式

答案2

以下是一些解决方案:

  1. grep。该模式搜索以>2 个非空格 ( [^ ]+ [^ ]+) 序列或任何字符 ( .+) 开头的行。只打印每行匹配部分的-o原因:grep

    $ grep -oP '^(>[^ ]+ [^ ]+|.+)' file.fa 
    >YAL003W EFB1
    MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD
    EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK
    SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS
    LDDLQQSIEEDEDHVQSTDIAAMQKL*
    
  2. awk

    $ awk '{if(/>/){print $1,$2}else{print}}' file.fa 
    >YAL003W EFB1
    MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD
    EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK
    SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS
    LDDLQQSIEEDEDHVQSTDIAAMQKL*
    
  3. GNUsed

    $ sed -r 's/(>[^ ]+ [^ ]+).*/\1/' file.fa 
    >YAL003W EFB1
    MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD
    EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK
    SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS
    LDDLQQSIEEDEDHVQSTDIAAMQKL*
    
  4. 任何sed

    $ sed 's/\(>[^ ]* [^ ]*\).*/\1/' file.fa 
    >YAL003W EFB1
    MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD
    EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK
    SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS
    LDDLQQSIEEDEDHVQSTDIAAMQKL*
    
  5. cut

    $ cut -d ' ' -f 1,2 file.fa 
    >YAL003W EFB1
    MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD
    EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK
    SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS
    LDDLQQSIEEDEDHVQSTDIAAMQKL*
    
  6. 珀尔。

    $ perl -lane 'print "@F[0..1]"' file.fa 
    >YAL003W EFB1
    MASTDFSKIETLKQLNASLADKSYIEGTAVSQADVTVFKAFQSAYPEFSRWFNHIASKAD 
    EFDSFPAASAAAAEEEEDDDVDLFGSDDEEADAEAEKLKAERIAAYNAKKAAKPAKPAAK 
    SIVTLDVKPWDDETNLEEMVANVKAIEMEGLTWGAHQFIPIGFGIKKLQINCVVEDDKVS 
    LDDLQQSIEEDEDHVQSTDIAAMQKL* 
    

    选项有

    • l:从每个输入行中删除尾随换行符,并向每个打印调用添加换行符。
    • a:将空白处的每个输入行拆分到@F数组中。
    • n:逐行读取输入文件。
    • e:在每一行运行此脚本。

    脚本本身只是打印第一个和第二个字段。对于序列行,它只会打印唯一可用的字段,即第一个字段。这是整条线。

相关内容