我有一个文本文件,其结果输出如下所示,我只需要映射文件路径名称和从该文本文件写入的总序列数,请建议如何在单独的文本文件中获取这些数据。
Input file paths
**Mapping filepath: map_leaf_M_BAN.AC.txt** (md5: a746e6e6227fafebc545d7a7e107d55c)
Sequence read filepath: leaf-45_S51_L001.m150-p1.join.fq (md5:
8753a0afe8b89d7768e911142a1536fe)
Quality filter results
Total number of input sequences: 32992
Barcode not in mapping file: 0
Read too short after quality truncation: 682
Count of N characters exceeds limit: 0
Illumina quality digit = 0: 0
Barcode errors exceed max: 0
Result summary (after quality filtering)
Median sequence length: 273.00
LMBANAC 32310
**Total number seqs written 32310**
亲切的问候
答案1
简单的管道和文本工具就可以完成这项工作:
walt@bat:~(0)$ grep -E -o 'Mapping filepath: [^*]+' Data.file | cut "-d " -f3
map_leaf_M_BAN.AC.txt
# Note the following regexp is fixed below - user's file had a TAB
walt@bat:~(0)$ grep -E -o 'Total number seqs written +[0-9]+' Data.file | awk '{print $5}'
32310
由于该文件包含一个TAB
字符(来自注释),
$ grep "Total number seqs written" split_library_log.txt | cat -t
Total number seqs written^I32992
Total number seqs written^I38519
第二条grep
命令应该是
grep -E -o 'Total number seqs written[[:space]]+[0-9]+' Data.file | awk '{print $5}'
当然读man grep;man cut;man awk;man 7 regex
。