如何使用 awk 提取所需的列并创建新文件?

如何使用 awk 提取所需的列并创建新文件?

我的gtf文件位于 100 多个目录中。下面我展示了它们的样子。

SampleA
   |___________ SampleA.GRCh38.gtf
SampleB
   |___________ SampleB.GRCh38.gtf

这里我仅显示两个gtf文件作为示例。

SampleA.GRCh38.gtf如下所示:

# stringtie -e -B -p 8 -G /path/stringtie_output/stringtie_merged.gtf -o /path/SampleA.GRCh38.gtf /path/SampleA.sorted.bam
# StringTie version 1.3.3
chr1    StringTie       transcript      11594   191502  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    11594   14829   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    14970   15038   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "2"; cov "0.0";
chr1    StringTie       exon    15796   16765   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "3"; cov "0.0";
chr1    StringTie       exon    16858   17055   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "4"; cov "0.0";
chr1    StringTie       exon    17233   17742   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "5"; cov "0.0";
chr1    StringTie       exon    17915   18061   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "6"; cov "0.0";
chr1    StringTie       exon    18268   19364   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "7"; cov "0.0";
chr1    StringTie       exon    189836  191502  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "8"; cov "0.0";
chr1    StringTie       transcript      11594   195411  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1    StringTie       exon    11594   14829   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    14970   15236   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "2"; cov "0.0";
chr1    StringTie       exon    185758  187287  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "3"; cov "0.0";
chr1    StringTie       exon    187376  187577  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "4"; cov "0.0";
chr1    StringTie       exon    187755  187890  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "5"; cov "0.0";
chr1    StringTie       exon    188130  188266  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "6"; cov "0.0";
chr1    StringTie       exon    188439  188584  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "7"; cov "0.0";
chr1    StringTie       exon    188791  188902  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "8"; cov "0.0";
chr1    StringTie       exon    195263  195411  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "9"; cov "0.0";
chr1    StringTie       transcript      11594   197912  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.5"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";

如下SampleB.GRCh38.gtf所示:

# stringtie -e -B -p 8 -G /path/stringtie_output/stringtie_merged.gtf -o /path/SampleB.GRCh38.gtf /path/SampleB.sorted.bam
# StringTie version 1.3.3
chr1    StringTie       transcript      11594   191502  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; cov "0.0"; FPKM "0.000000"; TPM "1.000000";
chr1    StringTie       exon    11594   14829   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    14970   15038   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "2"; cov "0.0";
chr1    StringTie       exon    15796   16765   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "3"; cov "0.0";
chr1    StringTie       exon    16858   17055   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "4"; cov "0.0";
chr1    StringTie       exon    17233   17742   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "5"; cov "0.0";
chr1    StringTie       exon    17915   18061   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "6"; cov "0.0";
chr1    StringTie       exon    18268   19364   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "7"; cov "0.0";
chr1    StringTie       exon    189836  191502  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "8"; cov "0.0";
chr1    StringTie       transcript      11594   195411  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; cov "0.0"; FPKM "0.000000"; TPM "3.000000";
chr1    StringTie       exon    11594   14829   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "1"; cov "0.0";
chr1    StringTie       exon    14970   15236   .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "2"; cov "0.0";
chr1    StringTie       exon    185758  187287  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "3"; cov "0.0";
chr1    StringTie       exon    187376  187577  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "4"; cov "0.0";
chr1    StringTie       exon    187755  187890  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "5"; cov "0.0";
chr1    StringTie       exon    188130  188266  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "6"; cov "0.0";
chr1    StringTie       exon    188439  188584  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "7"; cov "0.0";
chr1    StringTie       exon    188791  188902  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "8"; cov "0.0";
chr1    StringTie       exon    195263  195411  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "9"; cov "0.0";
chr1    StringTie       transcript      11594   197912  .       -       .       gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.5"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";

我只想transcript从第三列中提取,transcript_id哪一列是第十列,TPM哪一列是最后一列。但TPM需要是样本名称。

我希望输出如下所示:

Type        transcript_id      SampleA      SampleB
transcript   MSTRG.7542.2      0.000000     1.000000
transcript   MSTRG.7542.6      0.000000     3.000000
transcript   MSTRG.7542.5      0.000000     1.000000

答案1

您需要从每个文件中提取相关记录,并将结果写入两个新的临时文件(可能使用awk),可能sort同时对其进行排序(使用)(示例文件显示它们已排序,但可能不是按正确的键排序)。以下是处理其中一个文件的示例:

awk '$3 == "transcript" {printf("%s %s %s ", $3, $10, $12, $18);}' SampleA.GRCh38.gtf | sort -k 2 > tf1

然后,您可以使用join合并 生成的两个临时/中间文件,awk以便每个记录都具有每个文件中的两个最终列。

join以下是您可能使用的命令示例:

join -o 1.1,1.2,1.3,2.3 -1 2 -2 2 tf1 tf2

您可能希望在运行之前打印标题行(例如使用命令printfjoin,并且您可能希望用join制表符替换输出中的空格(例如使用sed),或者使用另一个awk脚本来格式化输出。

从这些示例中,您应该能够编写一个脚本来处理这两个文件并生成所需的输出(并清理临时文件等)。

请注意,根据数据文件的大小,您甚至可以在一个awk(或pythonperl等)程序中完成所有操作(即可以轻松地将两个文件中的所有选定数据同时保存在内存中)。

答案2

您可以只删除join文件,然后awk删除那些包含的文件NF==4,因为只有您感兴趣的行才有第 18 个字段。所有其他行将只有 2 个字段

还对计算 的路径做出某些假设SampleB,但是您可以修改它以适应......

while IFS= read -r -d '' f; do                             #read the list of SampleA
        g=$(echo "$f" | sed "s/pleA/pleB/g")               #calculate path to SampleB
        if [[ -f "$g" ]]; then                             #check SampleB exists
                echo "$f" | sed "s/.*pleA\.//g"            #print sample No
                echo "Type transcript_id SampleA SampleB"  #print header
                                                           #do the join
                join -j 12 -o 1.3 -o 1.12 -o 1.18 -o2.18 <(sort -k 12 "$f") <(sort -k 12 "$g") | awk 'NF==4'
        fi   | sed 's/[;"]//g'| column -t                  #make it pretty
done < <(find . -type f -iname "*SampleA*" -print0)        #NULL separated list of SampleA

答案3

尝试使用以下命令

步骤1

awk '$3 ~ /transcript/{print $0}' file1|awk '{print $3,substr($12,2,12),substr($NF,2,8)}' > out1

第2步

awk '$3 == "transcript" {print substr($NF,2,8)}' file2  > out2

步骤3

paste out out1.txt | awk 'BEGIN{print "Type        transcript_id      SampleA      SampleB"}{print $0}'



Output

Type       transcript_id SampleA    SampleB
transcript MSTRG.7542.2 0.000000    1.000000
transcript MSTRG.7542.6 0.000000    3.000000
transcript MSTRG.7542.5 0.000000    0.000000

相关内容