不确定有多少人熟悉 DNA 测序数据,但如果这是我文件的一部分(以“>”开头的行是 ID,以字母开头的行是 DNA 序列):
>NB501013:9:HJJ75BGXX:4:13609:24076:18015/2
GGGGGGGAAAAAAA
>NB501013:9:HJJ75BGXX:4:21602:19346:16945/2
CTCGTCGCATCACAAAGGGAT
>NB501013:9:HJJ75BGXX:3:11407:17650:13229/2
CCGCGGGCCGGTGCGGGGGTTTTTTTGTTTTTTTGGTTACAACGGGTGGG
>NB501013:9:HJJ75BGXX:3:13509:1817:13239/2
CAGCCC
>NB501013:9:HJJ75BGXX:4:22611:20567:13384/2
GAATA
我想删除这一行:GGGGGGGAAAAAAA
连同它的测序 ID(我知道你可以使用 来做到这一点grep -B1
)。但是有人知道如何删除仅由两个字母组成的行吗?
另外,对于短于 5 个字母的序列,我想删除它们及其 ID,我不能简单地 grep 查找超过一定长度的行,因为所有 ID 都很长,所以我需要以某种方式grep -v
使用以字母开头(因此不以“>”开头)且长度超过一定长度的行。
因此,我的示例输出将是:
>NB501013:9:HJJ75BGXX:4:21602:19346:16945/2
CTCGTCGCATCACAAAGGGAT
>NB501013:9:HJJ75BGXX:3:11407:17650:13229/2
CCGCGGGCCGGTGCGGGGGTTTTTTTGTTTTTTTGGTTACAACGGGTGGG
>NB501013:9:HJJ75BGXX:3:13509:1817:13239/2
CAGCCC
答案1
尝试grep
使用P
erlC
兼容的RE
gexp 模块:
删除两个字母的组合:
pcregrep -Mv '>.*\n([ACGT])\1*([ACGT])\2*(\1|\2)*$' file
输出:
>NB501013:9:HJJ75BGXX:4:21602:19346:16945/2 CTCGTCGCATCACAAAGGGAT >NB501013:9:HJJ75BGXX:3:11407:17650:13229/2 CCGCGGGCCGGTGCGGGGGTTTTTTTGTTTTTTTGGTTACAACGGGTGGG >NB501013:9:HJJ75BGXX:3:13509:1817:13239/2 CAGCCC >NB501013:9:HJJ75BGXX:4:22611:20567:13384/2 GAATA
删除 5 个或更少字母的组合:
pcregrep -Mv '>.*\n[ACGT]{1,5}$' file
输出:
>NB501013:9:HJJ75BGXX:4:13609:24076:18015/2 GGGGGGGAAAAAAA >NB501013:9:HJJ75BGXX:4:21602:19346:16945/2 CTCGTCGCATCACAAAGGGAT >NB501013:9:HJJ75BGXX:3:11407:17650:13229/2 CCGCGGGCCGGTGCGGGGGTTTTTTTGTTTTTTTGGTTACAACGGGTGGG >NB501013:9:HJJ75BGXX:3:13509:1817:13239/2 CAGCCC
答案2
#!/usr/bin/env perl
#
# Usage: thisscriptname < someinputfile
use strict;
use warnings;
while (1) {
exit if eof;
# rash assumption there are always pairs of ID and sequence lines
# NOTE these contain a newline, so many need chomp() depending
# on what you do with them...
my $id = readline;
my $seq = readline;
# calculate unique sequence letters via hash (is there also a U
# or something? been a few decades since AP bio...)
my %chars;
$chars{$_}++ for $seq =~ m/[ATGC]/g;
# business logic time!
if (keys %chars > 2 and length $seq >= 5) {
print $id;
print $seq;
}
}
答案3
您可以考虑反转文件,测试 DNA 序列,如果测试通过,则忽略此行并下一个线:
tac file |
awk '!/^>/ && (length($1) < 5 || $1 == "GGGGGGGAAAAAAA") {getline; next} 1' |
tac