我想根据标题行中提到的第一条染色体创建单独的文件。标题行中提到了 24 条染色体,接下来的 2 行中提到了它们的序列。文件架构为:
标头
人类序列
其他基因组序列
然而,所有染色体序列都合并在一个文件中,我想将其拆分为单独的染色体文件及其各自的序列对。我为此创建了一个 Python 脚本,但在集群上上传大合并文件需要花费大量时间,并且经常会导致连接错误。所以,我想使用 Bash 脚本来完成它。
这个想法是在标题行的第二列上搜索“chrY”(或任何镶边名称),然后将该标题行及其后面的 2 个序列行粘贴到一个单独的文件中。
2057524 chrY 68 170 chrX 23685 23787 - 4125
TCCAGACTACCAGACACAAGACATTACACATTGTAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAGAATTTTATAATGTTTGGAAA
TCAAGGCCCCGGGCTACCTGACATTACCCTCATTAATGCATGAAATGCAGAATATTAACATGAGCAATTTAAGATGAACTTAAGATTCTGTAATGTTTAGAAA
细节:
2057524 chrY (human chromosome) 68 170 chrX (other genome chromosome) 23685 23787 - 4125 -> header line
TCCAGACTACCAGACACAAGACATTACACATTGTAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAGAATTTTATAATGTTTGGAAA (human sequence)
TCAAGGCCCCGGGCTACCTGACATTACCCTCATTAATGCATGAAATGCAGAATATTAACATGAGCAATTTAAGATGAACTTAAGATTCTGTAATGTTTAGAAA (other genome sequence)
供测试用:
2057521 chr10 57211219 57211230 NW_007726181v1 1018288 1018299 + 575
CTGGGCACTATG
CTGAGCGCGGTG
2057522 chr2 57211231 57214400 NW_007726181v1 1018406 1021615 + 116172
GTTTtgagcttgt----acccagcgctgcttttgccttgctctgtgaccccaggcaagctgcctcacctctctgggccagtttccccat-cgtacagtggTGCTGCACACCCTGGCCCTGGCCC-CGAGGTGGCTGGGAGGTGGCTCCTCAAACAGCCGCTTTCTCATCAGTGCCCGGTGCTGGGT-CAGGGATCGACTGAGGCTCT--GAGCTAACTAGGAAACACAGTGGCCTTG--GAGGGCTGGGGAGTGTCATGGGGGTG---GGGACAGGGAGCCACCGGTCGCATGTGACTGAACTCTT-----------------CACCCCAGTCTGTGGCTTTCCCGTTGCAGTGAGAGCCACGAGCCAAGGTGGGCACTTGATGTCGGATCTCTTCAACAAGCTGGTCATGAGGCGCAAGGGTAGGAGGCAGGGCCGCTGCCCGCCCTGGGTCGGCACCT---------------TGTAATTCTGTCCTGCCTTTTTCTTCCTGTATTTAAGTCTCCGGGGGCTGGGGGAATCAGGGTTTCCCACCAACCACCCTCACTCAGCCTTTTCCC-TCCAGGCATCTCTGGGAAAGGACCT------GGGGCTGGTGAGGGGCCCGGAGGAGCCTTTGCCCGCGTGTCAGACTCCATCCCTCCTGTGCCCCCACCGCAACAGCCACAGGCAGAGGAGGACGAGGACGACTGGGAATCCTAGGGGGCTCCATGACACCTTCCCCCCCAGACCCAGACTTGGGCTGTTGCTCTGACATGGACACAGCCAGGACAAGCTGCTCAGACCTGCTTCCCTGG-GAGGGGGTGACGGAACCAGCACTG---------TGTG-GAGACCAGCTTCAAGGAGCGGAAGGCTGGCTTGAGGCCACACAGCTGGGGCG---GGGACTT-CTGTCTGCCTGTGCTCCATGGGGGGACGGCTCCACC--------CAGCCTGCGCCACTGTGTTCTTAAGAGGCTTCCAGAGAAAACGGCA-CACCAATCAATA-----------AAGAACTGAGCAGAAACCAACAGTGTGCTTTTAATAAAGGACCTCTAGCTGTGCAGGATGCAAACGTCTCGGGGTCAGTGACTGCCTCCTGCCCCTGTTGGTCCCTAGGCAGTGGGGGCAGAAGCTCCCAGCTGACCTG------TTTCTCTGGGAGAGAAGGGCAGTCAGCAGGGGCAGCTGTTGCAGATGGGAGGAATAG--------TCTCCCACA----AAAAAGGTTTCAGTGACAGACACGGGGTCTCTAAAAATAGTCATGCTGAGAGCCCAATGGCCCTTGGCACAATTGCTGGTGTTGGGGTAGAAGATGTCTTGGAGTTTGCTCAAGTGGTTGAGAGGGAGGGAGGTGCCATCAACTT---GGAGGAACTGGCACCAAGCCAGGGAGATAGAAATCCAGGCAAGGCTGTGGGGCAGGTTAGGGAGCAAGGCTGCAGGGGTGACTCAGGAAGAAGGTGGGGGAGGTGACAAGCCCCCAGGCAGGGGCCCTGTGGCC-------------ATGGGGATCTTTTTAAATTGAGACTAGGGGGTGAATAGTCCAGGGCAGCTAACTTTAGTTATTATAGAAAG-GGCAGTAGCAGATGGGTCTG-CTCCGTCTCGCTTCTAAGAAGGTGG---------------GCAGGACAAATGGCAGCCTCCTGCAGAGGCCCAGTGAGAAGCCTGGCCC-------TCGGCCAC-----ACAGGATGGAAGACAGATTGGATTCCACAGAGGGGAGCTGCCCTGGGAAGATCTCACGGATGGCCAGGACCCACCATTTCTTCGGGGTTCCCCT-GTTTTCTCCAACGGGCACTAATGCCTGTGCCTGGGTCCTGGCAACAC----------------------TCTGGACTCCACACTCT--TCTGGGTTTCACCTTTGTA-GCAGGATCCCTGCAGATCAGGCCCATGACAAACACCGTCTCCAGCGGGCAGAGCAAAGGAAGGGCGCAGCGCCAGGCAGTGGTGCAGCTGCCTGTCAGGAAGAGGCCTACTTCT---GGTGAAACTGGGCAGAC---AAAAGGCAGTGAGAAATGTGATCTCGGGGTGGTGGAGGCTC-TAGGGAAAGGAAAAGGCAGGAGTGAACTTCCACACAGCAGCAATGGCAGAACCAAAGGTGGCTTTGACCTCCACGAGGGCTCAGATCCAGGCCAACAGCTTGTCCAGGACAGGGTGCCGGGTGTATCACTAATCCAGGAGCACTATGCTGGCAGAATCCCTTTGGTGCCTGATGGCCCTGCCTTCGTGGGAACAGAGGCTAAGGCTTTGAGTTACAGCTGCCTCCCCAACAGTGCATCCCCTTCTCCTTCCTCAGCCTCAGGTAGGAGACAGGGCAGGCAACCCCCCTTTCCTCTTCTCCCCTTCTCCAGCCCCTGTCTGTCCACCCAGCTGGAGGCAG--CCAGGCTTGCCTATGGACTGGTTGACAGCCTTCATGCACAGGTTCTCCACCAGAGCCTTTCTTGGGGGCCCCTGGCT--GGGCTCTGAGCTGGGAGTGAAGGGGATGACCCATGCGGACTGTTTGCTGC-------------TTGTAGCTTTCCCTGGGA-AAGACTCTGCCAGGCCTTGGAGCCAGACCAGGAGGCTTTATAGGCCACTGCAAGCAGCAGGGCTCCAGATGACATCACAGGGAATATCAAGAGGGTGTGGAGGGGCATCGAAGCCTCTCCAGGAG---ACAG----GAGAC---GCCGGCCCAGTAGAGCCCTAGGGGCGACGCCACTCCCACTCACTGTCTACTCTCCTCTCACCTCTGCAACACTGGGGACACTCACAAGATTGTGATCCAAGTCGGCCGTCGTCTTCTGCAGCTCTGGAGACCTGATGCTGGGGAAGGGCATGCCTGGCATCACCACACACCTGGGAGGAGACAGGAGCCTG-GGGCCGGTGG---------------------GCCCACACATCACCAGCTGCTCCGTTCTACCATTTCTTCAGCCCTCTTGGCTGTGC-CTGCGGCTCTGCCCCTCCCCTCTCTGCACCTACCACCCAGAGAGGGCTTGTTGAGCTCAGAGATCCCACCTAGGCCAATCCACTGGGTTCTGTGGCAGCGATGGCCTGCCTGATCTTCCACCTGCTCTCCCAGGGCCAAAGCCAGACCTGCTGAGCCCCTCCC--TCCAGCCGGCTGGT-CTGAGCAGTCACAGCCCGGCTTTGGGCTCCGATGGCAGCAGATGGCAGGTAGGGGTCCAGCTGCTGG-AGCGAGGGCCGGCCACGTATCACAG-CCAAGGAGATGAGCACAAG--CACTACTTACTGGCCTAGGTTGTCAGAGAAGTTGATGCTCTCACTCATCTTTCCTCCAATC
gtcctgagtttgccaaggcccagctctgcttctgacttgtcctgtg-----agacaaagtgcctaacgtctttgggccagtttcctcatccccacagtggggctgcaca-cctgcccgtgtcttacaggatggccgtgatgt------tca-----CCATTTTCTTAT-AGTACCCACTGCCAGATACACGGACAGACCAAAGCTCCCAGAGCTCA-TGGTGAACAT-GTGGCTGTGGAGAGGGCTGGGGACTGTTGCAGGGGCAAGTGAGCCAAGAGGGCACTGG-CGACTGGGCCTGGAGCCTCCCGACTTGGCCCCCGCACCCTCCCACCTGCAGCTTTCCTCTTGCAGTGAGAGCCACAAGCCAAGGTGGAGACCTGATGTCAGATCTCTTCAACAAGCTGGTCATGAGGCGCAAAGGTAGGAGGCAGAGGGGCCGCC--TTCAGGGCAGGGGCCTCAGGGTGTCCTGCAGTGTACTTCTGTTCTGCCTCTTTCTTCCTGTTTTTAAATCTTCAGGGGCTGTGTGACCTGGGGCCTCCATACACCCTCCCTCA--CAGCCTTTTCCCTTCCAGGTATCTCCGGGAAAGGACCTGGAACAGGGGCCAGCGAGGGGCCAGGAGGAGCCTTCGCCCGAATGTCAGACTCCATCCCGCCTCTGCCTCCCCCACAGCAGCCAC---CGGGAGAGGACGAGGATGACTGGGAATCCTAGGGGTCT-CAGCACTCCTTCCTCCCCCAACCCAGACTTGGGCTGTGGCCCTGAGACAGACACAGCTGGGACA--------------GCCCCCTTGGTGAGACAGGGATGGTG-CAGGACTGCCCTACGTCTGTGCTGGGCCTTCTTCAGGGAGCGGGTAGGTTGCATGAAACCATAAGTGTGGGGTGGGAGGGGCTCGCTCTCCACCTGTGCCCCACCGTGTGCCTGCTCTACCCACCCCTTCAGCGTGTGCTCCTCTTCCCGAAAGAGACT--CGAAGAAAACAGCACCATGAATCAATAAAGGACGATGTAAGAACTGAGCATAAACCAACAGTGCACTTTTAATTAAGGAGTCAAGGCTGGGTGGCTTGCAAACATCTGAGAACCAGTGACTG--TCCTGCCCC-GTGGGTCTCCAGGCAAT-GGGGCAGAACATCTGAGTGGACCAGGGCCCCTTGCACTGGCTCGAAGGTGCAGTCAGCAGGGGCAGCTGCTGTGGATGGGAGGGAGGGAGGGAGATGTTCCCACGGGATAAAGATGTCTCAGTGACAGACATGGGGTCTCTAAAAATAGTTGTGCTGAGAGCCTAATGGCCCTTGGCATAATTGCTGATGTCAGGGTAGAAGGTGTCTTGGAGTTTGCTCAAGTGCCTGAGAGGGAAGGAGGTGCCATCAACTTGGAGGAGGAATGGGAGCCAAGCCAGAGAGA-AAAGCCCTGCGCGGAGCTGTGGAGCAGACCA--GAGCACAGCTG-----------------------------------AGGCTGGCAGG-AGGAGCCGTGTGGACAGCAGAACTAGAAATGGGGAACGTTTTGAGT------------GTGAAATGTCTAGAACAGCTCATTTTAGCTAGGATGAACAGAGGCAG-----GATGGGCCTGTTTCCATCGGACCTCTGAGAAGGTGGCTACTGAGAAAACATGCAGGACAGAAG-----CTGCAGCAGAACACCGGGCAGGAGCCTGGCGCGGCCAGTGTGGCCACACTAAACAGGGAGGAAGATGCAATGG------CAGGGAGCAGCTGCCCTGCAGTGGGCTCAAGGGCAGTCAGGACCCACTGTTTACTCAGGATCAACCTAGTTTTCTCCAACTGGCTTTTCTACCTGGGCCTGCATGCGGGCAGCCCACTGATGCTGGAAGGGGGCTGGTCTGGACCTCACACTCTACACCTGGTTTCACCTTCTTAGGCAGGATCCCTGTAGACCAGGCCCAAGACAAACACCATTCTAAGTGGGCAGGGTAAAGGAAGAGC------CCGGGC--TGGTGCAGCCATCCATCAGGAACGGCCAAACTTCTCCCGATGAAACTGGGGAGATGGGAAAAGGCAGTGAGAGACTAGATCTCAGGGTGA-GCAGGCTCGGGGGGGAAGGAAAAGGCAGGACTGACCTTACGCATAGCAGCAACAGCATGGCCAAAGGTGGCCTTGACCTCCACACGGTCTCGGATCCAGCCTGGCAGCTTTGCCAGGATGGGTGGGCGGGCATATCGCTGGTCTAGGAGCACTATGCTGGCAAAATCCCTCTGGTGCCTGATGGCTCTGCCTGGATGGGAACAGAATTTGGGGCTCCTAGGTAAA-------------------ATCCTCTCCTGTGACTTCATTCTC-------------------CAACCACCCAT--CTGTACTCC----------CAACTATCCATCCTGACAGCCAGGAGCAGTCCCAGGCTTACCTATAGATTGGTTGACAGCCTTCATACACAGATTCTCCACCAAGGCCTTCCCTGGTGGGGGCTGGCCTGGGGTTCTGGGCTGGGAAGGGTAGAAAGGACCTATCAGAACTGTTCCTTACCTCCTGTCTAGTGTTCTAGCTCTCCCTGGGAGAAGAGCCTGCCAGGCTTTGGAGCAAGACCAGGCAGCTTCACAAGCCAGTGCCAGCAGCTGG------CACGATGTCATGGAGAAGGTCAAGAGGGGGACAGGAAACACC--AGCATGGCAAGGAAGTCACAGCTACAAGACCCTGCTATCTCAG------CCTAGGGAATACACCACACTTCCCCCCGGCC--CTCTCCTCAT-CCTCTGGAATCCTGGAGGTACTCACAAGGGTCTGATCCAAGTAGGTCATCTTCTCTTGTAGTTCTGGAGAGTTGATGTTGGGGTAGGGCATGCCCACCATCACTACACACCTAGGTGGAGATGCACGCCGATGGGCATGTGGCCTCACACTCACTGAGTCCTCACCCACATGCCACCGACTGCT--GCTCTACCTCTGCTGCCG--CTCTTGGCTATGCTCGGCAGCTCTACCCTCCGCATC-CCGTACCTACCACCTGGAAAGGATTTTTTCAGCTAAGAGACCCAGTCTAAGCCAATGAACATAGTCCTGATAAGGTTATGGTTTGCCCCATTTTCCATCTGCTCT-CAAAGGCCCAATCCAGAGTTGCTGAAACACTTCCCGCCTGGCTGCCTGATCCTGAGCAGC--CAGCCTGGGTGCAGACTCAGATGGCATCAGATTGCAGGT-GGGGCCCAGCTGCTGGAAGTGAGGAGTAGCCAGGTGTCATAGCCCCAGGAGAGAAGGAGAGGACCACTACTTACCGGCCTAGGTTGTCAGAGAAGTTAATCCCTTCACTCATCTTTCCTCCAACC
2057523 chrY 57214466 57215088 NW_007726181v1 1023265 1023919 + 29358
GGCCCATCCCACTCTAGGCATGGCTCCTCTCCACAGGAAAACTCCACTCCAGTGCTCAGCTTGCACCCTGGCACAGGCCAGCAGTTGCT---GGAAGTCAGACACCTGCAGATCAAGACCACAGCATCAAGACCCTGTGACCTCTCAAAGGCCTGGTGGAAAGGA--------------CACGG-----------GAAGTCTGGGCTAAGAGACAGCAAATACACATGAACAGAAAGAAGAGGTCAAAGAAAA--GGCTGACGGCAAGTTAACAAAAAGAAA--AATGGTGAATGATACCCGGTGCTGGCAATCTCGTTTAAACTACATGCAGGAACAGCAAAGGAAATCCGGCAAATTT-GCGcagtcattctcaacaccggccatgcagcaaaatcatcagtggaaatttaaaaaaatacacgtggccaggccccagcccaaatcact-aataagaatctccaggg-CTtcacctgttagactggcaaaaaatccaaaag--taaacactttgtggagaaacaggcactcctagacattgctggtgggatacagaacagtacaattctga------------tggtaatcagttaacaaattaaacatatttattttatacttttaaacccaggaatcccatatttaggagtctactgagaccaaacagc
GGCTCGCTCCGCCCGGGTCACA-CTCCTCACCGCAGGAGAACTCCACCAC-TCGCTCAGCCTCAGCCCCAGCGCACGCCAGCAGCTGCTCCCGGAAGTCAGACACCTGCAGA------CCACAATGGCAGGGCCCTGTGACCTCCCAGAGGCACAGGGGAGAAGAACCTCAGGCCTCGGCATGGAGGGCAAGACAGAAGTCTGGGCTGGAAGGCAGCAAGTACGTACAAACAGAAAAAAGAGCTAAAAAAAAAAAGGCTAACAACAAATTAACAATAATAAATAAATTGTTAATGATATCCAGTGTTGGCAGTTTCATTTAAGCTACTGGTAGAAACAGCAAAGGAAATCTGGCAAACTTGGCAcagtgattctcaaccctggctatgcatcaaaaccagcagtgggaatttaaaaaaatacACATGGCCCAGCCACAGTCCAAACTACTGAATAACAATCTCCAGGGttttcacctaccaaattggc---aaatccgaaagtttaaccactctgtggagaaaaaggcatttttaaacattgctggtgcaatacagaatagtacaactcttacataggggaatttgacaat-acttaacaaattaaatgga-----tttttactttttaactcaggaatctcatatctgggactccacccagaatacacagc
2057524 chrX 68 170 NW_007727164v1 23685 23787 - 4125
TCCAGACTACCAGACACAAGACATTACACATTGTAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAGAATTTTATAATGTTTGGAAA
TCAAGGCCCCGGGCTACCTGACATTACCCTCATTAATGCATGAAATGCAGAATATTAACATGAGCAATTTAAGATGAACTTAAGATTCTGTAATGTTTAGAAA
答案1
我建议grep
与-A
开关一起使用,它告诉它在比赛后也包含行。像这样的东西:
#!/bin/bash
file=$1
for i in `seq 1 20`; do
grep -A2 "chr$i " $file > seq_$i
done
grep -A2 "chrX " $file > seq_X
grep -A2 "chrY " $file > seq_X
然后执行:
./extract.sh myfile
答案2
$2
表示第二列"chr19"
意味着我们只搜索与以下内容相关的详细信息"chr19"
{c=3}c-->0
用于输出搜索模式的接下来 2 行的命令
awk '$2=="chr19"{c=3}c-->0' file > chr19_file