我想根据第一行中的信息将文件拆分为不同的文件。例如,我有:
输入:
1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 30 30 30 30
0 2 2 0 2 0 2 0 2 0 2 2 0 0 2 2 2 0 1 1 1 2 0 2 0 0 0 2 0 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 0 2 0 2
期望的输出:
output1.txt
02202020
02101011
02101011
output2.txt
2022002
1022002
1022002
output3.txt
220111
220000
220000
output4.txt
202000200202
202001200202
202001200202
输出30.txt
0202
0202
0202
答案1
$ awk '
NR == 1 {
for (i=1; i<=NF; i++) {
output[i] = "output" $i ".txt"
files[output[i]] = 1
}
next
}
{
for (i=1; i<=NF; i++) printf "%s", $i > output[i]
for (file in files) print "" > file
}
' input.filename
$ for f in output*.txt; do echo $f; cat $f; done
output1.txt
02202020
02101011
02101011
output2.txt
2022002
1022002
1022002
output3.txt
220111
220000
220000
output30.txt
00202
00202
00202
output4.txt
2020002
2020012
2020012
请注意,标题行有 32 个字段,其他行有 33 个字段。这需要首先修复。
答案2
Perl 脚本。
设置文件名$in
代替genome.txt
或将名称作为参数。
命名脚本counter.pl
并赋予其可执行权限,然后运行它./counter.pl
chmod 755 counter.pl
./counter.pl
或者
chmod 755 counter.pl
./counter.pl genome.txt
计数器.pl:
#!/usr/bin/perl
use strict;
use warnings;
my $in = $ARGV[0] || 'genome.txt'; # input file name
open (my $F, '<', $in) or die "Cannot open input file $!";
my $n = 0;
my %fd = ();
my @fd = ();
while (<$F>) {
# trim
s/^\s+//;
s/\s+$//;
next if (!$_); # Skip empty lines
my @x = split(/\s+/, $_);
# 1st line, open files
if ( ! $n++) {
my $fd = 0;
for (@x) {
open ($fd{$_}, '>', "output$_.txt")
or die ("Cannot open file $!")
if (!exists($fd{$_}));
$fd[$fd++] = $_;
}
}
else { # Write data
die ("Should have " . ($#fd+1) . " entries on line $n")
if ($#x != $#fd);
for (0 .. $#x) {
print {$fd{$fd[$_]}} ($x[$_]);
}
print {$fd{$_}} ("\n") for (keys %fd);
}
}
close $fd{$_} for (keys %fd);
close $F;
# the end
修复了每行的字数(示例中有时为 32,有时为 33)。
此版本可以容纳任何列的变化,但所有行必须具有相同的单词数。die
如果字数不同,或者无法打开文件,则会出现错误(行)。
只需调整文件名($in
)即可。
输入文件:(去掉末尾多余的0)
1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 30 30 30 30
0 2 2 0 2 0 2 0 2 0 2 2 0 0 2 2 2 0 1 1 1 2 0 2 0 0 0 2 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 2 0 2
输出1.txt
02202020
02101011
02101011
输出2.txt
2022002
1022002
1022002
输出30.txt
0202
0202
0202
输出3.txt
220111
220000
220000
输出4.txt
2020002
2020012
2020012
答案3
好的,也是为了好玩 - 一个纯 Bash 版本(根据要求),严重依赖内置读将单词发送到数组并将它们保存到文件中。这些文件的格式很好,为output001.txt ....output030.txt。使用@ringO 修改的数据文件进行测试。未经测试,但在非常大的文件上,它可能比其他文件更节省时间和资源。
数据:
1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 30 30 30 30
0 2 2 0 2 0 2 0 2 0 2 2 0 0 2 2 2 0 1 1 1 2 0 2 0 0 0 2 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 2 0 2
来源:
#!/usr/bin/env bash
# genome : to sort genome data sets according to patterns of the first (header)
# line of the file. Data must be space delimited. No dependencies.
#
# Usage:
#
# ./genome "data.txt"
# global arrays
sc=( ) # array of set element counts
sn=( ) # array of set id numbers
# output_file "set id"
# change the output pattern and digit output width as required - default
# pattern is output.txt and digit width of three : output000.txt
output_file(){
# format concept: pattern000.txt
local op='output.txt' # output pattern
local ow=3 # output width: 3 => 000
printf "%s%0${ow}d.%s" "${op%%.*}" "$1" "${op##*.}"
}
# define_sets "input.txt"
# identify sets - get elements count and sets id numbers from file
# header.
define_sets(){
# declare and initialize
local a an b c n
read -r c < "$1"
read -r a b <<< "$c"
n=0; sn=( $a )
# recurse header, identify sets
until [[ -z $b ]]
do
n=$((n+1))
an=$a
read -r a b <<< "$b"
[[ $an == $a ]] || { sn+=( $a ); sc+=( $n ); n=0; }
done
n=$((n+1))
sc+=( $n )
}
# reset_files
# optional function, clears file data, otherwise data is appended to existing
# output files.
reset_files(){
for s in ${sn[@]}
do
> "$(output_file "$s")"
done
}
# extract_data "input.txt"
# use defined sets to extract data from the input file and send it to required
# output files. Uses nested 'while read' to bypass file header as data is saved.
extract_data(){
local a c n s fn da=( )
while read -a da
do
while read -a da
do
a=0 n=0
for s in ${sc[@]}
do
c="$(echo "${da[@]:$a:$s}")" # words => string
echo "${c// /}" >> "$(output_file "${sn[$n]}")" # save
n=$((n+1))
a=$((a+s))
done
done
done < "$1"
}
define_sets "$1" # get data set structure from header
reset_files # optional, clears and resets files
extract_data "$1" # get data from input file and save
# end file
数据输出:
$ cat output001.txt
02202020
02101011
02101011
$ cat output002.txt
2022002
1022002
1022002
$ cat output003.txt
220111
220000
220000
$ cat output004.txt
2020002
2020012
2020012
$ cat output030.txt
0202
0202
0202
答案4
只是为了好玩,还有其他解决方案:
awk '{ for (i=1; i<=NF;i++){
if (NR==1) { file[i]=$i }
if (NR!=1) { f="output" file[i] ".txt";
g="output" file[i+1] ".txt";
printf("%s%s",$i,f==g?OFS:ORS)>>f;
close(f);
}
}
}' file
如果您需要不分隔的字段,请更改?OFS:
为?"":
.
接收未配对值的默认文件是output.txt
.如果第一行上的列数与处理的下一行不匹配,该文件将接收值。如果一切正确,它应该是空的。如果脚本运行后还存在,则说明某个地方有问题。