我有一个包含唯一列的文件,我想根据特定字符串 (chr) 将此列拆分为多个列。第一个字符串和第二个字符串、第二个字符串和第三个字符串、第n个字符串和第m个字符串之间的项目数不规则。
输入看起来像这样:
chr10:127293562-127293909
BRUNOL4(Hs/Mm)
CPEB4(Hs/Mm)
CUG-BP(Hs/Mm)
DAZAP1(Hs/Mm)
ENOX1(Hs/Mm)
FMR1(Hs/Mm)
chr11:49214073-49214804
BRUNOL4(Hs/Mm)
BRUNOL5(Hs/Mm)
CPEB2(Hs/Mm)
CPEB4(Hs/Mm)
CUG-BP(Hs/Mm)
HNRNPC(Hs/Mm)
HNRNPCL1(Hs/Mm)
HNRNPH1(Hs/Mm)
HuR(Hs/Mm)
MBNL1(Hs/Mm)
NOVA1(Hs/Mm)
chr11:49854587-49855127
A1CF(Hs/Mm)
BRUNOL4(Hs/Mm)
输出应如下所示:
chr10:127293562-127293909 chr11:49214073-49214804 chr11:498547-498551
BRUNOL4(Hs/Mm) BRUNOL4(Hs/Mm) A1CF(Hs/Mm)
CPEB4(Hs/Mm) BRUNOL5(Hs/Mm) BRUNOL4(Hs/Mm)
CUG-BP(Hs/Mm) CPEB2(Hs/Mm)
DAZAP1(Hs/Mm) CPEB4(Hs/Mm)
ENOX1(Hs/Mm) CUG-BP(Hs/Mm)
FMR1(Hs/Mm) HNRNPC(Hs/Mm)
HNRNPCL1(Hs/Mm)
HNRNPH1(Hs/Mm)
HuR(Hs/Mm)
MBNL1(Hs/Mm)
NOVA1(Hs/Mm)
答案1
$ csplit -zsf file -n 1 ip.txt /^chr/ {*} ; paste file* | column -nt
chr10:127293562-127293909 chr11:49214073-49214804 chr11:49854587-49855127
BRUNOL4(Hs/Mm) BRUNOL4(Hs/Mm) A1CF(Hs/Mm)
CPEB4(Hs/Mm) BRUNOL5(Hs/Mm) BRUNOL4(Hs/Mm)
CUG-BP(Hs/Mm) CPEB2(Hs/Mm)
DAZAP1(Hs/Mm) CPEB4(Hs/Mm)
ENOX1(Hs/Mm) CUG-BP(Hs/Mm)
FMR1(Hs/Mm) HNRNPC(Hs/Mm)
HNRNPCL1(Hs/Mm)
HNRNPH1(Hs/Mm)
HuR(Hs/Mm)
MBNL1(Hs/Mm)
NOVA1(Hs/Mm)
csplit
用于根据模式分割文件-z
删除空文件的选项(对于模式匹配第一行本身的情况)-s
抑制日志输出-f file -n 1
输出文件名以file
一位数字后缀开头ip.txt
是输入文件,/^chr/
是要处理的模式{*}
尽可能多的分割
paste
然后用于按列连接拆分文件column -nt
用于设置粘贴输出的样式,防止合并相邻分隔符和 GNU 扩展的-n
默认行为column
答案2
和珀尔没有任何管道:
#!/usr/bin/env perl
use strict; use warnings;
my $c = -1; my $arr = [];
while (<>) {
if (/^chr/) {$c++};
chomp;
push(@{ $arr->[$c] }, $_);
}
foreach my $i (0...scalar(@{ $arr->[1] }) -1) {
printf("%-30s %s\n", $arr->[0]->[$i], $arr->[1]->[$i]);
}
输出
chr10:127293562-127293909 chr11:49214073-49214804
BRUNOL4(Hs/Mm) BRUNOL4(Hs/Mm)
CPEB4(Hs/Mm) BRUNOL5(Hs/Mm)
CUG-BP(Hs/Mm) CPEB2(Hs/Mm)
DAZAP1(Hs/Mm) CPEB4(Hs/Mm)
ENOX1(Hs/Mm) CUG-BP(Hs/Mm)
FMR1(Hs/Mm) HNRNPC(Hs/Mm)