grep/awk/sed 用于仅由两个字母组成的行,以及以字母开头并满足一定长度的行

grep/awk/sed 用于仅由两个字母组成的行,以及以字母开头并满足一定长度的行

不确定有多少人熟悉 DNA 测序数据,但如果这是我文件的一部分(以“>”开头的行是 ID,以字母开头的行是 DNA 序列):

>NB501013:9:HJJ75BGXX:4:13609:24076:18015/2
GGGGGGGAAAAAAA
>NB501013:9:HJJ75BGXX:4:21602:19346:16945/2
CTCGTCGCATCACAAAGGGAT
>NB501013:9:HJJ75BGXX:3:11407:17650:13229/2
CCGCGGGCCGGTGCGGGGGTTTTTTTGTTTTTTTGGTTACAACGGGTGGG
>NB501013:9:HJJ75BGXX:3:13509:1817:13239/2
CAGCCC
>NB501013:9:HJJ75BGXX:4:22611:20567:13384/2
GAATA

我想删除这一行:GGGGGGGAAAAAAA

连同它的测序 ID(我知道你可以使用 来做到这一点grep -B1)。但是有人知道如何删除仅由两个字母组成的行吗?

另外,对于短于 5 个字母的序列,我想删除它们及其 ID,我不能简单地 grep 查找超过一定长度的行,因为所有 ID 都很长,所以我需要以某种方式grep -v使用以字母开头(因此不以“>”开头)且长度超过一定长度的行。

因此,我的示例输出将是:

>NB501013:9:HJJ75BGXX:4:21602:19346:16945/2
CTCGTCGCATCACAAAGGGAT
>NB501013:9:HJJ75BGXX:3:11407:17650:13229/2
CCGCGGGCCGGTGCGGGGGTTTTTTTGTTTTTTTGGTTACAACGGGTGGG
>NB501013:9:HJJ75BGXX:3:13509:1817:13239/2
CAGCCC

答案1

尝试grep使用PerlC兼容的REgexp 模块:

  • 删除两个字母的组合:

    pcregrep -Mv '>.*\n([ACGT])\1*([ACGT])\2*(\1|\2)*$' file
    

    输出:

    >NB501013:9:HJJ75BGXX:4:21602:19346:16945/2
    CTCGTCGCATCACAAAGGGAT
    >NB501013:9:HJJ75BGXX:3:11407:17650:13229/2
    CCGCGGGCCGGTGCGGGGGTTTTTTTGTTTTTTTGGTTACAACGGGTGGG
    >NB501013:9:HJJ75BGXX:3:13509:1817:13239/2
    CAGCCC
    >NB501013:9:HJJ75BGXX:4:22611:20567:13384/2
    GAATA
    
  • 删除 5 个或更少字母的组合:

     pcregrep -Mv '>.*\n[ACGT]{1,5}$' file
    

    输出:

    >NB501013:9:HJJ75BGXX:4:13609:24076:18015/2
    GGGGGGGAAAAAAA
    >NB501013:9:HJJ75BGXX:4:21602:19346:16945/2
    CTCGTCGCATCACAAAGGGAT
    >NB501013:9:HJJ75BGXX:3:11407:17650:13229/2
    CCGCGGGCCGGTGCGGGGGTTTTTTTGTTTTTTTGGTTACAACGGGTGGG
    >NB501013:9:HJJ75BGXX:3:13509:1817:13239/2
    CAGCCC
    

答案2

#!/usr/bin/env perl
#
# Usage: thisscriptname < someinputfile

use strict;
use warnings;

while (1) {
  exit if eof;
  # rash assumption there are always pairs of ID and sequence lines
  # NOTE these contain a newline, so many need chomp() depending
  # on what you do with them...
  my $id = readline;
  my $seq = readline;

  # calculate unique sequence letters via hash (is there also a U
  # or something? been a few decades since AP bio...)
  my %chars;
  $chars{$_}++ for $seq =~ m/[ATGC]/g;

  # business logic time!
  if (keys %chars > 2 and length $seq >= 5) {
    print $id;
    print $seq;
  }
}

答案3

您可以考虑反转文件,测试 DNA 序列,如果测试通过,则忽略此行并下一个线:

tac file |
  awk '!/^>/ && (length($1) < 5 || $1 == "GGGGGGGAAAAAAA") {getline; next} 1' |
  tac

相关内容