我正在尝试解析 antismash 的输出来计算 BGC 的数量。我发现人们使用 python 编写的脚本我一无所知,所以我试图使用 bash 脚本来解决这个问题。
具有预测簇的基因库格式文件如下所示:
head -30 sca_11_chr8_3_0.region001.gbk
LOCUS sca_11_chr8_3_0 45390 bp DNA linear UNK 01-JAN-1980
DEFINITION sca_11_chr8_3_0.
ACCESSION sca_11_chr8_3_0
VERSION sca_11_chr8_3_0
KEYWORDS .
SOURCE
ORGANISM
.
COMMENT ##antiSMASH-Data-START##
Version :: 6.0.1-a859617(changed)
Run date :: 2021-10-31 18:00:02
NOTE: This is a single cluster extracted from a larger record!
Orig. start :: 169481
Orig. end :: 214871
##antiSMASH-Data-END##
FEATURES Location/Qualifiers
protocluster 1..45390
/aStool="rule-based-clusters"
/contig_edge="False"
/core_location="join{[194767:194871](-),
[194650:194652](-), [191596:194619](-),
[189481:191503](-)}"
/cutoff="20000"
/detection_rule="cds(Condensation and (AMP-binding or
A-OX))"
/neighbourhood="20000"
/product="NRPS"
/protocluster_number="1"
/tool="antismash"
proto_core complement(join(20001..22022,22116..25138,25170..25171,
尾 sca_11_chr8_3_0.region001.gbk
44881 ggagcttgtg gagagaagtg agacgtatcg cacgaatgct cttcagcaga tgctgggcag
44941 ttagaggatt tgcactttag tttcatagag ttgatgtgtc gaggagataa tttgagatac
45001 cagtatatgt aatttaccta cctacctagt cgagattgga cattgtacaa gagaaataac
45061 aactaactat acgagacaag cctgatgtgt tgatagtttc attcatgtct ggtgtttgtg
45121 gcatgtttat gttggagtag ctgtacagaa gataccgcgc tattcccagt gatcatggcc
45181 cccacgcctc caactcggca cctgaccttg atcccctttg ggaagcatgt ctcagtgtct
45241 cagccgtgag ccgtagaggc tgcacagcat ggagaagctg tcctgtcaat tcaggggatt
45301 tgcccacggg ggctatcata tgatgaatct cggacaccct acacgttgtt accgcctttc
45361 ttagctcctg ctggtagccg tcccctgaac
//
首先,我将 gbk 文件连接成一个,以便它包含所有预测的簇,然后 grep 给出轨迹 id、簇的开始和结束以及簇类型的字符。
cat sca_*.gbk > Necha2_SMclusters.gbk
grep "DEFINITION\|Orig\|product=" Necha2_SMclusters.gbk > Necha2_SMclusters_filtered.txt
这给了我一个像这样的文件
DEFINITION sca_32_chr11_3_0.
Orig. start :: 381231
Orig. end :: 428233
/product="T1PKS"
/product="T1PKS"
/product="T1PKS"
/product="T1PKS"
DEFINITION sca_32_chr11_3_0.
Orig. start :: 464307
Orig. end :: 486217
/product="terpene"
/product="terpene"
/product="terpene"
/product="terpene"
DEFINITION sca_33_chr6_1_0.
Orig. start :: 140267
Orig. end :: 227928
/product="NRPS-like"
/product="T1PKS"
/product="NRPS-like"
/product="T1PKS"
/product="NRPS-like"
/product="NRPS-like"
/product="NRPS-like"
/product="T1PKS"
/product="T1PKS"
/product="T1PKS"
DEFINITION sca_39_chr11_5_0.
Orig. start :: 270154
Orig. end :: 324310
/product="NRPS"
/product="NRPS"
/product="NRPS"
/product="NRPS"
我想从这个文件中获取一个如下所示的文件。
Locus name start end ClusterType
sca_9_chr7_10_0. 369577 421460 T1PKS,NRPS
sca_33_chr6_1_0. 140267 227928 NRPS-like, T1PKS
sca_32_chr11_3_0 381231 428233 T1PKS
现在,这就是我需要一个包含所有预测集群的文件。
太感谢了!!
答案1
给定此示例输入:
$ cat file1.gbk
DEFINITION sca_32_chr11_3_0.
foo
Orig. start :: 381231
Orig. end :: 428233
/product="T1PKS"
/product="T1PKS"
bar
/product="T1PKS"
/product="T1PKS"
//
stuff
DEFINITION sca_32_chr11_3_0.
Orig. start :: 464307
Orig. end :: 486217
/product="terpene"
nonsense
/product="terpene"
/product="terpene"
/product="terpene"
//
DEFINITION sca_33_chr6_1_0.
Orig. start :: 140267
Orig. end :: 227928
/product="NRPS-like"
/product="T1PKS"
whatever
/product="NRPS-like"
/product="T1PKS"
/product="NRPS-like"
/product="NRPS-like"
/product="NRPS-like"
/product="T1PKS"
/product="T1PKS"
/product="T1PKS"
$ cat file2.gbk
here we go
DEFINITION sca_39_chr11_5_0.
Orig. start :: 270154
more irrelevant text
Orig. end :: 324310
/product="NRPS"
/product="NRPS"
/product="NRPS"
/product="NRPS"
这个脚本:
$ cat tst.awk
BEGIN { OFS="\t" }
$1 == "DEFINITION" {
if ( ++cnt == 1 ) {
print "Locus name", "start", "end", "ClusterType"
}
prt()
locus = $2
}
/Orig\. start/ { start = $NF }
/Orig\. end/ { end = $NF }
sub(".*/product=","") { gsub(/"/,""); types[$NF] }
END { prt() }
function prt( ct, type) {
if ( locus != "" ) {
for (type in types) {
ct = (ct=="" ? "" : ct ",") type
}
print locus, start, end, ct
}
delete types
locus = ""
}
将产生以下输出:
$ awk -f tst.awk *.gbk
Locus name start end ClusterType
sca_32_chr11_3_0. 381231 428233 T1PKS
sca_32_chr11_3_0. 464307 486217 terpene
sca_33_chr6_1_0. 140267 227928 T1PKS,NRPS-like
sca_39_chr11_5_0. 270154 324310 NRPS