根据每个标题行特定列上的关键字将文件拆分为单独的文件

根据每个标题行特定列上的关键字将文件拆分为单独的文件

我想根据标题行中提到的第一条染色体创建单独的文件。标题行中提到了 24 条染色体,接下来的 2 行中提到了它们的序列。文件架构为:
标头
人类序列
其他基因组序列

然而,所有染色体序列都合并在一个文件中,我想将其拆分为单独的染色体文件及其各自的序列对。我为此创建了一个 Python 脚本,但在集群上上传大合并文件需要花费大量时间,并且经常会导致连接错误。所以,我想使用 Bash 脚本来完成它。

这个想法是在标题行的第二列上搜索“chrY”(或任何镶边名称),然后将该标题行及其后面的 2 个序列行粘贴到一个单独的文件中。

2057524 chrY 68 170 chrX 23685 23787 - 4125
TCCAGACTACCAGACACAAGACATTACACATTGTAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAGAATTTTATAATGTTTGGAAA
TCAAGGCCCCGGGCTACCTGACATTACCCTCATTAATGCATGAAATGCAGAATATTAACATGAGCAATTTAAGATGAACTTAAGATTCTGTAATGTTTAGAAA

细节:

2057524 chrY (human chromosome) 68 170 chrX (other genome chromosome) 23685 23787 - 4125 -> header line
TCCAGACTACCAGACACAAGACATTACACATTGTAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAGAATTTTATAATGTTTGGAAA (human sequence)
TCAAGGCCCCGGGCTACCTGACATTACCCTCATTAATGCATGAAATGCAGAATATTAACATGAGCAATTTAAGATGAACTTAAGATTCTGTAATGTTTAGAAA (other genome sequence)

供测试用:

2057521 chr10 57211219 57211230 NW_007726181v1 1018288 1018299 + 575
CTGGGCACTATG
CTGAGCGCGGTG

2057522 chr2 57211231 57214400 NW_007726181v1 1018406 1021615 + 116172
GTTTtgagcttgt----acccagcgctgcttttgccttgctctgtgaccccaggcaagctgcctcacctctctgggccagtttccccat-cgtacagtggTGCTGCACACCCTGGCCCTGGCCC-CGAGGTGGCTGGGAGGTGGCTCCTCAAACAGCCGCTTTCTCATCAGTGCCCGGTGCTGGGT-CAGGGATCGACTGAGGCTCT--GAGCTAACTAGGAAACACAGTGGCCTTG--GAGGGCTGGGGAGTGTCATGGGGGTG---GGGACAGGGAGCCACCGGTCGCATGTGACTGAACTCTT-----------------CACCCCAGTCTGTGGCTTTCCCGTTGCAGTGAGAGCCACGAGCCAAGGTGGGCACTTGATGTCGGATCTCTTCAACAAGCTGGTCATGAGGCGCAAGGGTAGGAGGCAGGGCCGCTGCCCGCCCTGGGTCGGCACCT---------------TGTAATTCTGTCCTGCCTTTTTCTTCCTGTATTTAAGTCTCCGGGGGCTGGGGGAATCAGGGTTTCCCACCAACCACCCTCACTCAGCCTTTTCCC-TCCAGGCATCTCTGGGAAAGGACCT------GGGGCTGGTGAGGGGCCCGGAGGAGCCTTTGCCCGCGTGTCAGACTCCATCCCTCCTGTGCCCCCACCGCAACAGCCACAGGCAGAGGAGGACGAGGACGACTGGGAATCCTAGGGGGCTCCATGACACCTTCCCCCCCAGACCCAGACTTGGGCTGTTGCTCTGACATGGACACAGCCAGGACAAGCTGCTCAGACCTGCTTCCCTGG-GAGGGGGTGACGGAACCAGCACTG---------TGTG-GAGACCAGCTTCAAGGAGCGGAAGGCTGGCTTGAGGCCACACAGCTGGGGCG---GGGACTT-CTGTCTGCCTGTGCTCCATGGGGGGACGGCTCCACC--------CAGCCTGCGCCACTGTGTTCTTAAGAGGCTTCCAGAGAAAACGGCA-CACCAATCAATA-----------AAGAACTGAGCAGAAACCAACAGTGTGCTTTTAATAAAGGACCTCTAGCTGTGCAGGATGCAAACGTCTCGGGGTCAGTGACTGCCTCCTGCCCCTGTTGGTCCCTAGGCAGTGGGGGCAGAAGCTCCCAGCTGACCTG------TTTCTCTGGGAGAGAAGGGCAGTCAGCAGGGGCAGCTGTTGCAGATGGGAGGAATAG--------TCTCCCACA----AAAAAGGTTTCAGTGACAGACACGGGGTCTCTAAAAATAGTCATGCTGAGAGCCCAATGGCCCTTGGCACAATTGCTGGTGTTGGGGTAGAAGATGTCTTGGAGTTTGCTCAAGTGGTTGAGAGGGAGGGAGGTGCCATCAACTT---GGAGGAACTGGCACCAAGCCAGGGAGATAGAAATCCAGGCAAGGCTGTGGGGCAGGTTAGGGAGCAAGGCTGCAGGGGTGACTCAGGAAGAAGGTGGGGGAGGTGACAAGCCCCCAGGCAGGGGCCCTGTGGCC-------------ATGGGGATCTTTTTAAATTGAGACTAGGGGGTGAATAGTCCAGGGCAGCTAACTTTAGTTATTATAGAAAG-GGCAGTAGCAGATGGGTCTG-CTCCGTCTCGCTTCTAAGAAGGTGG---------------GCAGGACAAATGGCAGCCTCCTGCAGAGGCCCAGTGAGAAGCCTGGCCC-------TCGGCCAC-----ACAGGATGGAAGACAGATTGGATTCCACAGAGGGGAGCTGCCCTGGGAAGATCTCACGGATGGCCAGGACCCACCATTTCTTCGGGGTTCCCCT-GTTTTCTCCAACGGGCACTAATGCCTGTGCCTGGGTCCTGGCAACAC----------------------TCTGGACTCCACACTCT--TCTGGGTTTCACCTTTGTA-GCAGGATCCCTGCAGATCAGGCCCATGACAAACACCGTCTCCAGCGGGCAGAGCAAAGGAAGGGCGCAGCGCCAGGCAGTGGTGCAGCTGCCTGTCAGGAAGAGGCCTACTTCT---GGTGAAACTGGGCAGAC---AAAAGGCAGTGAGAAATGTGATCTCGGGGTGGTGGAGGCTC-TAGGGAAAGGAAAAGGCAGGAGTGAACTTCCACACAGCAGCAATGGCAGAACCAAAGGTGGCTTTGACCTCCACGAGGGCTCAGATCCAGGCCAACAGCTTGTCCAGGACAGGGTGCCGGGTGTATCACTAATCCAGGAGCACTATGCTGGCAGAATCCCTTTGGTGCCTGATGGCCCTGCCTTCGTGGGAACAGAGGCTAAGGCTTTGAGTTACAGCTGCCTCCCCAACAGTGCATCCCCTTCTCCTTCCTCAGCCTCAGGTAGGAGACAGGGCAGGCAACCCCCCTTTCCTCTTCTCCCCTTCTCCAGCCCCTGTCTGTCCACCCAGCTGGAGGCAG--CCAGGCTTGCCTATGGACTGGTTGACAGCCTTCATGCACAGGTTCTCCACCAGAGCCTTTCTTGGGGGCCCCTGGCT--GGGCTCTGAGCTGGGAGTGAAGGGGATGACCCATGCGGACTGTTTGCTGC-------------TTGTAGCTTTCCCTGGGA-AAGACTCTGCCAGGCCTTGGAGCCAGACCAGGAGGCTTTATAGGCCACTGCAAGCAGCAGGGCTCCAGATGACATCACAGGGAATATCAAGAGGGTGTGGAGGGGCATCGAAGCCTCTCCAGGAG---ACAG----GAGAC---GCCGGCCCAGTAGAGCCCTAGGGGCGACGCCACTCCCACTCACTGTCTACTCTCCTCTCACCTCTGCAACACTGGGGACACTCACAAGATTGTGATCCAAGTCGGCCGTCGTCTTCTGCAGCTCTGGAGACCTGATGCTGGGGAAGGGCATGCCTGGCATCACCACACACCTGGGAGGAGACAGGAGCCTG-GGGCCGGTGG---------------------GCCCACACATCACCAGCTGCTCCGTTCTACCATTTCTTCAGCCCTCTTGGCTGTGC-CTGCGGCTCTGCCCCTCCCCTCTCTGCACCTACCACCCAGAGAGGGCTTGTTGAGCTCAGAGATCCCACCTAGGCCAATCCACTGGGTTCTGTGGCAGCGATGGCCTGCCTGATCTTCCACCTGCTCTCCCAGGGCCAAAGCCAGACCTGCTGAGCCCCTCCC--TCCAGCCGGCTGGT-CTGAGCAGTCACAGCCCGGCTTTGGGCTCCGATGGCAGCAGATGGCAGGTAGGGGTCCAGCTGCTGG-AGCGAGGGCCGGCCACGTATCACAG-CCAAGGAGATGAGCACAAG--CACTACTTACTGGCCTAGGTTGTCAGAGAAGTTGATGCTCTCACTCATCTTTCCTCCAATC
gtcctgagtttgccaaggcccagctctgcttctgacttgtcctgtg-----agacaaagtgcctaacgtctttgggccagtttcctcatccccacagtggggctgcaca-cctgcccgtgtcttacaggatggccgtgatgt------tca-----CCATTTTCTTAT-AGTACCCACTGCCAGATACACGGACAGACCAAAGCTCCCAGAGCTCA-TGGTGAACAT-GTGGCTGTGGAGAGGGCTGGGGACTGTTGCAGGGGCAAGTGAGCCAAGAGGGCACTGG-CGACTGGGCCTGGAGCCTCCCGACTTGGCCCCCGCACCCTCCCACCTGCAGCTTTCCTCTTGCAGTGAGAGCCACAAGCCAAGGTGGAGACCTGATGTCAGATCTCTTCAACAAGCTGGTCATGAGGCGCAAAGGTAGGAGGCAGAGGGGCCGCC--TTCAGGGCAGGGGCCTCAGGGTGTCCTGCAGTGTACTTCTGTTCTGCCTCTTTCTTCCTGTTTTTAAATCTTCAGGGGCTGTGTGACCTGGGGCCTCCATACACCCTCCCTCA--CAGCCTTTTCCCTTCCAGGTATCTCCGGGAAAGGACCTGGAACAGGGGCCAGCGAGGGGCCAGGAGGAGCCTTCGCCCGAATGTCAGACTCCATCCCGCCTCTGCCTCCCCCACAGCAGCCAC---CGGGAGAGGACGAGGATGACTGGGAATCCTAGGGGTCT-CAGCACTCCTTCCTCCCCCAACCCAGACTTGGGCTGTGGCCCTGAGACAGACACAGCTGGGACA--------------GCCCCCTTGGTGAGACAGGGATGGTG-CAGGACTGCCCTACGTCTGTGCTGGGCCTTCTTCAGGGAGCGGGTAGGTTGCATGAAACCATAAGTGTGGGGTGGGAGGGGCTCGCTCTCCACCTGTGCCCCACCGTGTGCCTGCTCTACCCACCCCTTCAGCGTGTGCTCCTCTTCCCGAAAGAGACT--CGAAGAAAACAGCACCATGAATCAATAAAGGACGATGTAAGAACTGAGCATAAACCAACAGTGCACTTTTAATTAAGGAGTCAAGGCTGGGTGGCTTGCAAACATCTGAGAACCAGTGACTG--TCCTGCCCC-GTGGGTCTCCAGGCAAT-GGGGCAGAACATCTGAGTGGACCAGGGCCCCTTGCACTGGCTCGAAGGTGCAGTCAGCAGGGGCAGCTGCTGTGGATGGGAGGGAGGGAGGGAGATGTTCCCACGGGATAAAGATGTCTCAGTGACAGACATGGGGTCTCTAAAAATAGTTGTGCTGAGAGCCTAATGGCCCTTGGCATAATTGCTGATGTCAGGGTAGAAGGTGTCTTGGAGTTTGCTCAAGTGCCTGAGAGGGAAGGAGGTGCCATCAACTTGGAGGAGGAATGGGAGCCAAGCCAGAGAGA-AAAGCCCTGCGCGGAGCTGTGGAGCAGACCA--GAGCACAGCTG-----------------------------------AGGCTGGCAGG-AGGAGCCGTGTGGACAGCAGAACTAGAAATGGGGAACGTTTTGAGT------------GTGAAATGTCTAGAACAGCTCATTTTAGCTAGGATGAACAGAGGCAG-----GATGGGCCTGTTTCCATCGGACCTCTGAGAAGGTGGCTACTGAGAAAACATGCAGGACAGAAG-----CTGCAGCAGAACACCGGGCAGGAGCCTGGCGCGGCCAGTGTGGCCACACTAAACAGGGAGGAAGATGCAATGG------CAGGGAGCAGCTGCCCTGCAGTGGGCTCAAGGGCAGTCAGGACCCACTGTTTACTCAGGATCAACCTAGTTTTCTCCAACTGGCTTTTCTACCTGGGCCTGCATGCGGGCAGCCCACTGATGCTGGAAGGGGGCTGGTCTGGACCTCACACTCTACACCTGGTTTCACCTTCTTAGGCAGGATCCCTGTAGACCAGGCCCAAGACAAACACCATTCTAAGTGGGCAGGGTAAAGGAAGAGC------CCGGGC--TGGTGCAGCCATCCATCAGGAACGGCCAAACTTCTCCCGATGAAACTGGGGAGATGGGAAAAGGCAGTGAGAGACTAGATCTCAGGGTGA-GCAGGCTCGGGGGGGAAGGAAAAGGCAGGACTGACCTTACGCATAGCAGCAACAGCATGGCCAAAGGTGGCCTTGACCTCCACACGGTCTCGGATCCAGCCTGGCAGCTTTGCCAGGATGGGTGGGCGGGCATATCGCTGGTCTAGGAGCACTATGCTGGCAAAATCCCTCTGGTGCCTGATGGCTCTGCCTGGATGGGAACAGAATTTGGGGCTCCTAGGTAAA-------------------ATCCTCTCCTGTGACTTCATTCTC-------------------CAACCACCCAT--CTGTACTCC----------CAACTATCCATCCTGACAGCCAGGAGCAGTCCCAGGCTTACCTATAGATTGGTTGACAGCCTTCATACACAGATTCTCCACCAAGGCCTTCCCTGGTGGGGGCTGGCCTGGGGTTCTGGGCTGGGAAGGGTAGAAAGGACCTATCAGAACTGTTCCTTACCTCCTGTCTAGTGTTCTAGCTCTCCCTGGGAGAAGAGCCTGCCAGGCTTTGGAGCAAGACCAGGCAGCTTCACAAGCCAGTGCCAGCAGCTGG------CACGATGTCATGGAGAAGGTCAAGAGGGGGACAGGAAACACC--AGCATGGCAAGGAAGTCACAGCTACAAGACCCTGCTATCTCAG------CCTAGGGAATACACCACACTTCCCCCCGGCC--CTCTCCTCAT-CCTCTGGAATCCTGGAGGTACTCACAAGGGTCTGATCCAAGTAGGTCATCTTCTCTTGTAGTTCTGGAGAGTTGATGTTGGGGTAGGGCATGCCCACCATCACTACACACCTAGGTGGAGATGCACGCCGATGGGCATGTGGCCTCACACTCACTGAGTCCTCACCCACATGCCACCGACTGCT--GCTCTACCTCTGCTGCCG--CTCTTGGCTATGCTCGGCAGCTCTACCCTCCGCATC-CCGTACCTACCACCTGGAAAGGATTTTTTCAGCTAAGAGACCCAGTCTAAGCCAATGAACATAGTCCTGATAAGGTTATGGTTTGCCCCATTTTCCATCTGCTCT-CAAAGGCCCAATCCAGAGTTGCTGAAACACTTCCCGCCTGGCTGCCTGATCCTGAGCAGC--CAGCCTGGGTGCAGACTCAGATGGCATCAGATTGCAGGT-GGGGCCCAGCTGCTGGAAGTGAGGAGTAGCCAGGTGTCATAGCCCCAGGAGAGAAGGAGAGGACCACTACTTACCGGCCTAGGTTGTCAGAGAAGTTAATCCCTTCACTCATCTTTCCTCCAACC

2057523 chrY 57214466 57215088 NW_007726181v1 1023265 1023919 + 29358
GGCCCATCCCACTCTAGGCATGGCTCCTCTCCACAGGAAAACTCCACTCCAGTGCTCAGCTTGCACCCTGGCACAGGCCAGCAGTTGCT---GGAAGTCAGACACCTGCAGATCAAGACCACAGCATCAAGACCCTGTGACCTCTCAAAGGCCTGGTGGAAAGGA--------------CACGG-----------GAAGTCTGGGCTAAGAGACAGCAAATACACATGAACAGAAAGAAGAGGTCAAAGAAAA--GGCTGACGGCAAGTTAACAAAAAGAAA--AATGGTGAATGATACCCGGTGCTGGCAATCTCGTTTAAACTACATGCAGGAACAGCAAAGGAAATCCGGCAAATTT-GCGcagtcattctcaacaccggccatgcagcaaaatcatcagtggaaatttaaaaaaatacacgtggccaggccccagcccaaatcact-aataagaatctccaggg-CTtcacctgttagactggcaaaaaatccaaaag--taaacactttgtggagaaacaggcactcctagacattgctggtgggatacagaacagtacaattctga------------tggtaatcagttaacaaattaaacatatttattttatacttttaaacccaggaatcccatatttaggagtctactgagaccaaacagc
GGCTCGCTCCGCCCGGGTCACA-CTCCTCACCGCAGGAGAACTCCACCAC-TCGCTCAGCCTCAGCCCCAGCGCACGCCAGCAGCTGCTCCCGGAAGTCAGACACCTGCAGA------CCACAATGGCAGGGCCCTGTGACCTCCCAGAGGCACAGGGGAGAAGAACCTCAGGCCTCGGCATGGAGGGCAAGACAGAAGTCTGGGCTGGAAGGCAGCAAGTACGTACAAACAGAAAAAAGAGCTAAAAAAAAAAAGGCTAACAACAAATTAACAATAATAAATAAATTGTTAATGATATCCAGTGTTGGCAGTTTCATTTAAGCTACTGGTAGAAACAGCAAAGGAAATCTGGCAAACTTGGCAcagtgattctcaaccctggctatgcatcaaaaccagcagtgggaatttaaaaaaatacACATGGCCCAGCCACAGTCCAAACTACTGAATAACAATCTCCAGGGttttcacctaccaaattggc---aaatccgaaagtttaaccactctgtggagaaaaaggcatttttaaacattgctggtgcaatacagaatagtacaactcttacataggggaatttgacaat-acttaacaaattaaatgga-----tttttactttttaactcaggaatctcatatctgggactccacccagaatacacagc

2057524 chrX 68 170 NW_007727164v1 23685 23787 - 4125
TCCAGACTACCAGACACAAGACATTACACATTGTAATGCATTAAATGCATAGTTTTAACAGTAATAATTTAAAAGAGATTTAGAATTTTATAATGTTTGGAAA
TCAAGGCCCCGGGCTACCTGACATTACCCTCATTAATGCATGAAATGCAGAATATTAACATGAGCAATTTAAGATGAACTTAAGATTCTGTAATGTTTAGAAA

答案1

我建议grep-A开关一起使用,它告诉它在比赛后也包含行。像这样的东西:

#!/bin/bash
file=$1

for i in `seq 1 20`; do
  grep -A2 "chr$i " $file > seq_$i
done

grep -A2 "chrX " $file > seq_X
grep -A2 "chrY " $file > seq_X

然后执行:

./extract.sh myfile

答案2

  • $2表示第二列
  • "chr19"意味着我们只搜索与以下内容相关的详细信息"chr19"
  • {c=3}c-->0用于输出搜索模式的接下来 2 行的命令
awk '$2=="chr19"{c=3}c-->0' file > chr19_file

相关内容