从文本文件提取数据

从文本文件提取数据

我有一个文本文件,其结果输出如下所示,我只需要映射文件路径名称和从该文本文件写入的总序列数,请建议如何在单独的文本文件中获取这些数据。

Input file paths

**Mapping filepath: map_leaf_M_BAN.AC.txt** (md5: a746e6e6227fafebc545d7a7e107d55c)

Sequence read filepath: leaf-45_S51_L001.m150-p1.join.fq (md5: 
8753a0afe8b89d7768e911142a1536fe)

Quality filter results

Total number of input sequences: 32992

Barcode not in mapping file: 0

Read too short after quality truncation: 682

Count of N characters exceeds limit: 0

Illumina quality digit = 0: 0

Barcode errors exceed max: 0

Result summary (after quality filtering)

Median sequence length: 273.00

LMBANAC 32310


**Total number seqs written       32310**

亲切的问候

答案1

简单的管道和文本工具就可以完成这项工作:

walt@bat:~(0)$ grep -E -o 'Mapping filepath: [^*]+' Data.file | cut "-d " -f3
map_leaf_M_BAN.AC.txt
             # Note the following regexp is fixed below - user's file had a TAB
walt@bat:~(0)$ grep -E -o 'Total number seqs written +[0-9]+' Data.file | awk '{print $5}'
32310

由于该文件包含一个TAB字符(来自注释),

$ grep "Total number seqs written" split_library_log.txt | cat -t 
Total number seqs written^I32992 
Total number seqs written^I38519 

第二条grep命令应该是

 grep -E -o 'Total number seqs written[[:space]]+[0-9]+' Data.file | awk '{print $5}' 

当然读man grep;man cut;man awk;man 7 regex

相关内容