如何在linux中根据列将单个文件拆分为多个文件?

如何在linux中根据列将单个文件拆分为多个文件?

我有一个包含以下信息的文本文件:

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build
MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38
USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38
SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38
TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38
PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38
HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38
NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38

正如您所看到的,有多个示例,我想根据“Tumor_Sample_Barcode”列将文件拆分为多个文件。输出文件需要以samplename.txt 命名。

第一个输出 - TCGA-BD-A2L6-01A-11D-A20W-10.txt

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build
MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38
USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38
SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38

第二个输出 - TCGA-O8-A75V-01A-11D-A32G-10.txt

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build
TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

第三个输出 - TCGA-G3-A7M5-01A-11D-A33Q-10.txt

Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build
PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38
HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38
NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38

这个linux怎么办?

答案1

Awk解决方案:

awk 'NR==1{ h=$0 }NR>1{ print (!a[$2]++? h ORS $0 : $0) > $2".txt" }' file
  • NR==1{ h=$0 }- 捕获第一行/记录为标头line(NR指向记录号,$0-包含当前行)
  • NR > 1- 对于除第一个记录之外的所有记录:
    • <cond>? <operand_1> : <operand_2>- 经典三元运算符
    • !a[$2]++?- 检查第一次出现条码$2用作关联数组的键的值a
    • h ORS $0- 公共标题行与ORS(输出记录分隔符,默认为\n)和当前记录连接$0
    • print ... > $2".txt"- 将自定义内容或当前行(如果未指定任何内容)打印到文件中<barcode_value>.txt

或者一个更不言自明的版本:

awk 'NR==1 {header = $0; next}
     !header_printed[$2]++ {print header > $2".txt"}
     {print > $2".txt"}' < file

查看结果:

$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build
MTHFR   TCGA-BD-A2L6-01A-11D-A20W-10    4524    BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10    7779    BCM GRCh38
USH2A   TCGA-BD-A2L6-01A-11D-A20W-10    7399    BCM GRCh38
SOS1    TCGA-BD-A2L6-01A-11D-A20W-10    6654    BCM GRCh38

==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build
PRDM16  TCGA-G3-A7M5-01A-11D-A33Q-10    63976   BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10    55735   BCM GRCh38
HNRNPCL2    TCGA-G3-A7M5-01A-11D-A33Q-10    440563  BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10    84970   BCM GRCh38
NFYC    TCGA-G3-A7M5-01A-11D-A33Q-10    4802    BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10    3652    BCM GRCh38

==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode    Entrez_Gene_Id  Center  NCBI_Build
TMEM51  TCGA-O8-A75V-01A-11D-A32G-10    55092   BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10    2312    BCM GRCh38

根据 15 个字符的序列调整文件名条码价值:

awk 'NR==1{ h=$0 }NR>1{ print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" }' file 

相关内容