每次给定列的内容发生更改时拆分文件

每次给定列的内容发生更改时拆分文件

我正在处理一个文本文件,并且遇到了一个问题,尽管进行了一些谷歌搜索和询问,但我无法解决该问题。

我想根据第 2 列的内容(以 分隔)将此文件(20,880 行)拆分为单独的文件|。每次第 2 列的内容发生更改时,我都需要一个新文件。不幸的是,第 2 列的每个实例的行数并不规则,因此我不能只将文件拆分为每一n行。

以下是原始文件的前几行:

>00000000|gene_cluster:GC_00001105|genome_name:r7534_20160316|gene_callers_id:24
>00000001|gene_cluster:GC_00001105|genome_name:r7537_20160321|gene_callers_id:78
>00000002|gene_cluster:GC_00001105|genome_name:r7541_20160426|gene_callers_id:774
>00000003|gene_cluster:GC_00001105|genome_name:r7544_20160502|gene_callers_id:1034
>00000004|gene_cluster:GC_00001105|genome_name:r7547_20160512|gene_callers_id:330
>00000005|gene_cluster:GC_00001105|genome_name:r7550_20160517|gene_callers_id:2094
>00000006|gene_cluster:GC_00001290|genome_name:r7534_20160316|gene_callers_id:76
>00000007|gene_cluster:GC_00001290|genome_name:r7537_20160321|gene_callers_id:358
>00000008|gene_cluster:GC_00001290|genome_name:r7541_20160426|gene_callers_id:1601
>00000009|gene_cluster:GC_00001290|genome_name:r7544_20160502|gene_callers_id:2134

然后我按第二列对其进行排序,给出:

>00006406|gene_cluster:GC_00000001|genome_name:r7534_20160316|gene_callers_id:1988
>00006409|gene_cluster:GC_00000001|genome_name:r7537_20160321|gene_callers_id:1059
>00006410|gene_cluster:GC_00000001|genome_name:r7537_20160321|gene_callers_id:1811
>00006407|gene_cluster:GC_00000001|genome_name:r7537_20160321|gene_callers_id:1947
>00006411|gene_cluster:GC_00000001|genome_name:r7537_20160321|gene_callers_id:643
>00006408|gene_cluster:GC_00000001|genome_name:r7537_20160321|gene_callers_id:759
>00006412|gene_cluster:GC_00000001|genome_name:r7541_20160426|gene_callers_id:1252
>00006415|gene_cluster:GC_00000001|genome_name:r7541_20160426|gene_callers_id:1920
>00006414|gene_cluster:GC_00000001|genome_name:r7541_20160426|gene_callers_id:2021
>00006413|gene_cluster:GC_00000001|genome_name:r7541_20160426|gene_callers_id:2094

但我还没有弄清楚如何在每次第二列更改时拆分文件。我该如何分割这个文件?

谢谢!

答案1

使用awk

$ awk -F"|" '{print > $2}' input_file
$ head gene_cluster*
==> gene_cluster:GC_00001105 <==
>00000000|gene_cluster:GC_00001105|genome_name:r7534_20160316|gene_callers_id:24
>00000001|gene_cluster:GC_00001105|genome_name:r7537_20160321|gene_callers_id:78
>00000002|gene_cluster:GC_00001105|genome_name:r7541_20160426|gene_callers_id:774
>00000003|gene_cluster:GC_00001105|genome_name:r7544_20160502|gene_callers_id:1034
>00000004|gene_cluster:GC_00001105|genome_name:r7547_20160512|gene_callers_id:330
>00000005|gene_cluster:GC_00001105|genome_name:r7550_20160517|gene_callers_id:2094

==> gene_cluster:GC_00001290 <==
>00000006|gene_cluster:GC_00001290|genome_name:r7534_20160316|gene_callers_id:76
>00000007|gene_cluster:GC_00001290|genome_name:r7537_20160321|gene_callers_id:358
>00000008|gene_cluster:GC_00001290|genome_name:r7541_20160426|gene_callers_id:1601
>00000009|gene_cluster:GC_00001290|genome_name:r7544_20160502|gene_callers_id:2134

答案2

awk -F'|' '$2 != out{close(out); out=$2} {print > out}'

如果您不关闭每个输出文件,那么一旦超过同时打开文件的阈值,awk 脚本将失败或显着减慢,具体取决于您的 awk 版本,例如参见分割 ssl 证书时出现错误 awk 太多输出文件 10或者awk-cannot-open-04477c9a875b80-csv-for-output-too-many-open-files

相关内容