删除 fastq 文件中的读取内容

删除 fastq 文件中的读取内容

我想删除 fastq 文件的四行。例如,通常该文件如下所示:(每个样本对应四行)

@M04241:303:000000000-BR896:1:1102:21438:12389 1:N:0:TATGGCAC
TGTCAGCCGCCGCGGTAATACGGAGGGTCCGAGCGTTATCCGGAATTATTGGGTTTAAAGGGTCCGCAGGCGGGCTTATAAGTCAGGGGTGGAATGGTGCGGCTCAACCGTAGCACTGCCCTTGATACTGTTAGTCTTGAGTTATGGTGGAGTGGCCGGAATATGTAGTGTAGCGGTGAAATGCATAGATATTACATAGAACACCGATCGCGAAGGCAGGTCACTAACCATTTGACTGACGCTGATGGACGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCGGAAACGATGGATACTAGCTGTCGGGCACTTGTGCTCGGCGGCCAAGCGAAAGTGATAAGTATCCCACCTGGGGAGTACGTGCGCAAGAATGAAACTCAAATGAATTGACGG
+
EGGGGGGGGGGGGGGGGGGGGGGGDE@FFGEEEGGGGDGFEFGGGGGGGGGGGGGGGGGGGGGGGDGEFFGGGCGGFDF<DGGFGGGGGGGG7FFG?FDF:FGGGFCGGGGFEGGGF:>GGGG>F>DE@GG6@GGG@G9<EGGGG9FGGGGGG7FGGDDEFGGGGGGGGGGGGGGGGCEFGGGGFG?EFFCFGGGGGGFGG?GGGGGGGG=EGEGGGGGGGGGGGFGCGGFGGGGCFFF6CD7DDFFFFFED9:BFCBEE@DEF:@EGCFCF@FFFD?=A:CFEF0<C<A>FB>@6+C,@GFFGFDGGF<AFEFB+FEECGFF9FDFAC6@+:@FC:GFC,CFC,EFGE,9FFCGFF<@;6:,FD,D:FGGFFGF7@8+7,,CF<<6CF<CC-CA@<GEGFE@6@A,CB
@M04241:303:000000000-BR896:1:1103:11464:7575 1:N:0:TATGGCAC
GTCAATTTCTTTGCGTTTCAATCTTGCGATCGTACTCCCCAGGTGGGATACTTATCACTTTCGCTTAGTCACTGAGATAAATCCCAACAACTAGTGTCCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTCGCTCCCCACGCTTTCGTCCATCAGCGTCAGTATATGGTTAGTGACCTGCCTTCGCGATCGGTGTTCTATGTAATATCTATGCATTTCACCGCTACACTACATATTCCGGCCACTCCACCATAACTCAAGACTAACAGTATCAAAGGCAGTGCTACGGTTGAGCCGCACCATTTCACCCCTGACTTATCAGCCCGCCTGCGGACCCTTTAAACCCAATAATTCCGGATAACGCTCGGACCCTCCGTATTACCGCGGCTGCTGGC
+
CCCCCGGGGGGGG-FCFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGFFGGDFFGFGFGGGGGGGGGGGGGGGGGGGGGGGGGEGGEGGGGDGGG4FFGGGGGGGGGGGGGGGGGGGGGEGGGGGGFGGGFFGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFFFGFFFGFGGGGGGGGGGGGGGGGGGGFGGFFGGGGGGGGGGGGGGGGGGGCDGGGGGGGGFCFGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGFGGGGGCGEFFGGEGGGGGGGGGGGGGGGGGDGGGGFFCGGGGGGGGGGGGFGGGDGGGGGGGGGGGGFGGGGGGGGGGGGGGGGG
@M04241:303:000000000-BR896:1:1103:23291:21403 1:N:0:TATGGCAC
CTGCGGCACCGCAGGGCAAGCCCCCCGACGCCTAGCCCACATCGTTTAGGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCGCCTCAGCGTCAGTGCCGGACCAGAGAGCCGCTTTCGCCACCGGTGTTCCACCCAATATCTACGAATTTCACCTCTACACTGGGTATTCCACCCTCCTCTTCCGGACTCGAGCACCGCAGTCTCGGCTGCACCTCCGGGGTTGAGCCCCGGGCTTTCACAGCCGACTTGCGACGCCGCCTACGCGCCCTTTACGCCCAGTGATTCCGAACAACGCTAGCACCCTCCGTCTTACCGCGGCGGCTGAC
+
CCCCCGGGGGG>FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG@@FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

但我发现一个样本的四行中有两行是空的,如下所示:

@M04241:303:000000000-BR896:1:1103:11464:7575 1:N:0:TATGGCAC

+

@M04241:303:000000000-BR896:1:1103:23291:21403 1:N:0:TATGGCAC
CTGCGGCACCGCAGGGCAAGCCCCCCGACGCCTAGCCCACATCGTTTAGGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCGCCTCAGCGTCAGTGCCGGACCAGAGAGCCGCTTTCGCCACCGGTGTTCCACCCAATATCTACGAATTTCACCTCTACACTGGGTATTCCACCCTCCTCTTCCGGACTCGAGCACCGCAGTCTCGGCTGCACCTCCGGGGTTGAGCCCCGGGCTTTCACAGCCGACTTGCGACGCCGCCTACGCGCCCTTTACGCCCAGTGATTCCGAACAACGCTAGCACCCTCCGTCTTACCGCGGCGGCTGAC
+
CCCCCGGGGGG>FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG@@FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@M04241:303:000000000-BR896:1:1103:26180:21941 1:N:0:TATGGCAC
CCGCCAATTTCTTTGAGTTTCAGCCTTGCGACCATACTCCCCAGGCGGGGTACTTAACACTTTTGATTCGGCAGTGCACCCATGTTAGTCCACTACCTAGTACCCATCGTTTAGGGCTAGGACTACCGGGGTATCTAATCCCGTTCGCTACCCTAGCTTTCGCGCCTCAGCGTCAGAAGAGGTCCAGCACGTCGCTTTCGCCACCGGCGTTCCTTCCGATCTCTACGCATTTCACCGCTCCACCGGAAGTTCCACATGCCCCTACCTCCCTCGAGATTGGCAGTTTCGAAGGCAGTTCTACAGTTGAGCTGCAGGATTTCACCTCCGACTGACCTATCCGCCTACGCGCCCTTTAAGCCCAGTGATTCCGAACAACGTTCGC
+
CCCCCGEGGGGGGGGGGEGGGGGGGGGGDFGGGGGGGGGGGGGEGGGGGGEFGGGFFFFGGGGGG,CEFGGGGGGGGGG?GGGGGG9FFGGGGGGGCGGGGGGGGGCFGGGG@GGGGGFGGGGGGGGGCGGFGGGGGGGGGGGGGGGGGGGGGGGFFGGGGGGGGGDEGGGGGGGDGGGGFGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGDGEFGGEEGGGGFGGGGGGGGGGGGGGGGGGGGGEF?GGGEGGEEFEFFDFFGFGGFGGGGGGFFFGFGGGGGGGGGFGGGGFCGGGGGGGGGFFGGGGGGGGGGGGGGGFF@7GGGGGGGGGGGGGGGFDFCGGGGFEFGGFGGGGGGGGFGFEGGGG
@M04241:303:000000000-BR896:1:1102:21438:12389 1:N:0:TATGGCAC
TGTCAGCCGCCGCGGTAATACGGAGGGTCCGAGCGTTATCCGGAATTATTGGGTTTAAAGGGTCCGCAGGCGGGCTTATAAGTCAGGGGTGGAATGGTGCGGCTCAACCGTAGCACTGCCCTTGATACTGTTAGTCTTGAGTTATGGTGGAGTGGCCGGAATATGTAGTGTAGCGGTGAAATGCATAGATATTACATAGAACACCGATCGCGAAGGCAGGTCACTAACCATTTGACTGACGCTGATGGACGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCGGAAACGATGGATACTAGCTGTCGGGCACTTGTGCTCGGCGGCCAAGCGAAAGTGATAAGTATCCCACCTGGGGAGTACGTGCGCAAGAATGAAACTCAAATGAATTGACGG
+
EGGGGGGGGGGGGGGGGGGGGGGGDE@FFGEEEGGGGDGFEFGGGGGGGGGGGGGGGGGGGGGGGDGEFFGGGCGGFDF<DGGFGGGGGGGG7FFG?FDF:FGGGFCGGGGFEGGGF:>GGGG>F>DE@GG6@GGG@G9<EGGGG9FGGGGGG7FGGDDEFGGGGGGGGGGGGGGGGCEFGGGGFG?EFFCFGGGGGGFGG?GGGGGGGG=EGEGGGGGGGGGGGFGCGGFGGGGCFFF6CD7DDFFFFFED9:BFCBEE@DEF:@EGCFCF@FFFD?=A:CFEF0<C<A>FB>@6+C,@GFFGFDGGF<AFEFB+FEECGFF9FDFAC6@+:@FC:GFC,CFC,EFGE,9FFCGFF<@;6:,FD,D:FGGFFGF7@8+7,,CF<<6CF<CC-CA@<GEGFE@6@A,CB

如何检测此空行并从 fastq 文件中删除?我知道行号,但是这是一个巨大的文件,我无法正常打开,因此我需要一个命令来检测这两行是否为空并删除与该样本关联的四行。

谢谢!!

答案1

sed 'N;N;N;/\n\n/d' file.fastq >new-file.fastq

这将读入 FastQ 记录的四行,然后检查它是否包含两个连续的换行符。如果是这样,则整个记录将被忽略。如果没有,则打印它。这将对文件中的所有条目重复。所有打印的记录都会进入一个新文件(此处new-file.fastq)。

脚本sed,带有注释:

         # (implicit: read a line)
N;       # read a second line, append it to the pattern space with embedded \n in-between
N;       # read a third line
N;       # read a fourth line
/\n\n/d  # if there are two consecutive newlines, delete and continue from top
         # (implicit: print)

来自同事的评论:

Fastq 记录通常是配对的,当未找到配对伙伴时,软件往往会大发雷霆,而没有明确告诉它配对缺失。有几种工具具有最小长度选项,例如 trimomatic,它将保持配对并分离孤立的记录。

这意味着,如果文件中的读取是配对的,并且其中一对是空的,则仅删除空记录就会弄乱配对。

除非使用现有的生物信息学工具,否则删除空读的配合会复杂得多。使用标准 Unix 工具箱中的工具,可能需要将空读取保存到单独的文件中,然后使用其 FastQ 标头扫描并删除相应的配合。

问题中显示的数据似乎只是不配对的读取。

相关内容