如何在Linux中将文件与所需的列连接起来?

如何在Linux中将文件与所需的列连接起来?

我有很多文件,例如目录“结果”中的以下文件

58052 results/TB1.genes.results
198003 results/TB1.isoforms.results
58052 results/TB2.genes.results
198003 results/TB2.isoforms.results
58052 results/TB3.genes.results
198003 results/TB3.isoforms.results
58052 results/TB4.genes.results
198003 results/TB4.isoforms.results

例如:TB1.genes.results 文件如下所示:

gene_id transcript_id(s)        length  effective_length        expected_count  TPM     FPKM
ENSG00000000003 ENST00000373020,ENST00000494424,ENST00000496771,ENST00000612152,ENST00000614008 2206.00 1997.20 1.00    0.00    0.01
ENSG00000000005 ENST00000373031,ENST00000485971 940.50  731.73  0.00    0.00    0.00
ENSG00000000419 ENST00000371582,ENST00000371584,ENST00000371588,ENST00000413082,ENST00000466152,ENST00000494752 977.15  768.35  1865.00 14.27   37.82
ENSG00000000457 ENST00000367770,ENST00000367771,ENST00000367772,ENST00000423670,ENST00000470238 3779.11 3570.31 1521.00 2.50    6.64
ENSG00000000460 ENST00000286031,ENST00000359326,ENST00000413811,ENST00000459772,ENST00000466580,ENST00000472795,ENST00000481744,ENST00000496973,ENST00000498289 1936.74 1727.94 1860.00 6.33    16.77
ENSG00000000938 ENST00000374003,ENST00000374004,ENST00000374005,ENST00000399173,ENST00000457296,ENST00000468038,ENST00000475472 2020.10 1811.30 6846.00 22.22   58.90
ENSG00000000971 ENST00000359637,ENST00000367429,ENST00000466229,ENST00000470918,ENST00000496761,ENST00000630130 2587.83 2379.04 0.00    0.00    0.00
ENSG00000001036 ENST00000002165,ENST00000367585,ENST00000451668 1912.64 1703.85 1358.00 4.69    12.42
ENSG00000001084 ENST00000229416,ENST00000504353,ENST00000504525,ENST00000505197,ENST00000505294,ENST00000509541,ENST00000510837,ENST00000513939,ENST00000514004,ENST00000514373,ENST00000514933,ENST00000515580,ENST00000616923      2333.50 2124.73 1178.00 3.26    8.64

其他文件也有相同的列。要将所有“genes.results”与“gene_id”和“expected_count”列加入到一个文本文件中,我给出了以下命令。

paste results/*.genes.results | tail -n+2 | cut -f1,5,12,19,26 > final.genes.rsem.txt

[-f1 (gene_id), 5 (expected_count column from TB1.genes.results), 12 (expected_count column from TB2.genes.results), 
19 (expected_count column from TB3.genes.results), 26 (expected_count column from TB4.genes.results)]

“final.genes.rsem.txt”从每个文件中选择了gene_id和expected_count列。

ENSG00000000003 1.00    0.00    3.00    2.00
ENSG00000000005 0.00    0.00    0.00    0.00
ENSG00000000419 1865.00 1951.00 5909.00 8163.00
ENSG00000000457 1521.00 1488.00 849.00  1400.00
ENSG00000000460 1860.00 1616.00 2577.00 2715.00
ENSG00000000938 6846.00 5298.00 1.00    2.00
ENSG00000000971 0.00    0.00    6159.00 7069.00
ENSG00000001036 1358.00 1186.00 6196.00 7009.00
ENSG00000001084 1178.00 1186.00 631.00  1293.00

我的问题是 - 由于我只有很少的样本,所以我在命令中给出了列号[就像“cut”-f1,5,12,19,26]中的那样。如果我有超过 100 个样本我该怎么办?我如何将它们与所需的列连接起来?

答案1

使用 GNU awk。我把这个命令放在 bash 脚本中。会更方便。

用法: ./join_files.sh或者,为了漂亮的打印,请执行以下操作:./join_files.sh | column -t

#!/bin/bash

gawk '
NR == 1 {
    PROCINFO["sorted_in"] = "@ind_num_asc";
    header = $1;
}

FNR == 1 {
    file = gensub(/.*\/([^.]*)\..*/, "\\1", "g", FILENAME); 
    header = header OFS file;   
}

FNR > 1 {
    arr[$1] = arr[$1] OFS $5;
}

END {
    print header;

    for(i in arr) {
        print i arr[i];
    }
}' results/*.genes.results

输出(我创建了三个内容相同的文件进行测试)

$ ./join_files.sh | column -t
gene_id          TB1      TB2      TB3
ENSG00000000003  1.00     1.00     1.00
ENSG00000000005  0.00     0.00     0.00
ENSG00000000419  1865.00  1865.00  1865.00
ENSG00000000457  1521.00  1521.00  1521.00
ENSG00000000460  1860.00  1860.00  1860.00
ENSG00000000938  6846.00  6846.00  6846.00
ENSG00000000971  0.00     0.00     0.00
ENSG00000001036  1358.00  1358.00  1358.00
ENSG00000001084  1178.00  1178.00  1178.00

解释- 添加注释的相同代码。另外,看看man gawk.

gawk '
# NR - the total number of input records seen so far.
# If the total line number is equal 1

NR == 1 {
    # If the "sorted_in" element exists in PROCINFO, then its value controls 
    # the order in which array elements are traversed in the (for in) loop.
    # else the order is undefined.

    PROCINFO["sorted_in"] = "@ind_num_asc";

    # Each field in the input record may be referenced by its position: $1, $2, and so on.
    # $1 - is the first field or the first column. 
    # The first field in the first line is the "gene_id" word;
    # Assign it to the header variable.

    header = $1;
}

# FNR - the input record number in the current input file.
# NR is the total lines counter, FNR is the current file lines counter.
# FNR == 1 - if it is the first line of the current file.

FNR == 1 {
    # remove from the filename all unneeded parts by the "gensub" function
    # was - results/TB1.genes.results
    # become - TB1

    file = gensub(/.*\/([^.]*)\..*/, "\\1", "g", FILENAME); 

    # and add it to the header variable, concatenating it with the 
    # previous content of the header, using OFS as delimiter.
    # OFS - the output field separator, a space by default.

    header = header OFS file;   
}

# some trick is used here.
# $1 - the first column value - "gene_id"
# $5 - the fifth column value - "expected_count"
FNR > 1 {
    # create array with "gene_id" indexes: arr["ENSG00000000003"], arr["ENSG00000000419"], so on.
    # and add "expected_count" values to it, separated by OFS.
    # each time, when the $1 equals to the specific "gene_id", the $5 value will be
    # added into this array item.

    # Example:
    # arr["ENSG00000000003"] = 1.00
    # arr["ENSG00000000003"] = 1.00 2.00
    # arr["ENSG00000000003"] = 1.00 2.00 3.00

    arr[$1] = arr[$1] OFS $5;
}

END {
    print header;

    for(i in arr) {
        print i arr[i];
    }
}' results/*.genes.results

答案2

如果我正确理解你的问题,你想知道当你需要输出许多列时如何处理情况。cut您正在使用的命令了解列的范围。例如,为了输出第 1、5 列以及从 7 到 13 以及从 17 到最后的所有列,请使用

cut -f1,5,7-13,17-

或者您可以使用该cut命令排除特定字段。例如,排除字段号 5

cut --compliment -f5

因为您想要做的所有事情 - 正如我所看到的 - 就是删除第二列,即transcript_id,我会使用

cut --compliment -f2

ps 请注意,您提供的数据不适用于您的脚本。我猜你已经简化了它并删除了一些列。

相关内容