如何根据原始文件中的列标题将文件拆分为单独的文件?

如何根据原始文件中的列标题将文件拆分为单独的文件?

我想根据第一行中的信息将文件拆分为不同的文件。例如,我有:

输入:

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 30 30 30 30
0 2 2 0 2 0 2 0 2 0 2 2 0 0 2 2 2 0 1 1 1 2 0 2 0 0 0 2 0 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 0 2 0 2

期望的输出:

output1.txt

02202020
02101011
02101011

output2.txt

2022002
1022002
1022002

output3.txt

220111
220000
220000

output4.txt

202000200202
202001200202
202001200202

输出30.txt

0202
0202
0202

答案1

$ awk '
    NR == 1 {
        for (i=1; i<=NF; i++) {
            output[i] = "output" $i ".txt"
            files[output[i]] = 1
        }
        next
    }
    {
        for (i=1; i<=NF; i++)  printf "%s", $i > output[i]
        for (file in files)    print ""        > file
    }
' input.filename

$ for f in output*.txt; do echo $f; cat $f; done
output1.txt
02202020
02101011
02101011
output2.txt
2022002
1022002
1022002
output3.txt
220111
220000
220000
output30.txt
00202
00202
00202
output4.txt
2020002
2020012
2020012

请注意,标题行有 32 个字段,其他行有 33 个字段。这需要首先修复。

答案2

Perl 脚本。

设置文件名$in代替genome.txt或将名称作为参数。

命名脚本counter.pl并赋予其可执行权限,然后运行它./counter.pl

chmod 755 counter.pl
./counter.pl

或者

chmod 755 counter.pl
./counter.pl genome.txt

计数器.pl:

#!/usr/bin/perl

use strict;
use warnings;

my $in = $ARGV[0] || 'genome.txt'; # input file name

open (my $F, '<', $in) or die "Cannot open input file $!";
my $n = 0;
my %fd = ();
my @fd = ();

while (<$F>) {
        # trim
        s/^\s+//;
        s/\s+$//;
        next if (!$_); # Skip empty lines
        my @x = split(/\s+/, $_);
        # 1st line, open files
        if ( ! $n++)  {
           my $fd = 0;
           for (@x) {
              open ($fd{$_}, '>', "output$_.txt") 
                or die ("Cannot open file $!")
                  if (!exists($fd{$_}));
              $fd[$fd++] = $_;
           }
        }
        else { # Write data
           die ("Should have " . ($#fd+1) . " entries on line $n")
             if ($#x != $#fd);
           for (0 .. $#x) {
              print {$fd{$fd[$_]}} ($x[$_]);
           }
           print {$fd{$_}} ("\n") for (keys %fd);
        }
}

close $fd{$_} for (keys %fd);
close $F;
# the end

修复了每行的字数(示例中有时为 32,有时为 33)。

此版本可以容纳任何列的变化,但所有行必须具有相同的单词数。die如果字数不同,或者无法打开文件,则会出现错误(行)。

只需调整文件名($in)即可。

输入文件:(去掉末尾多余的0)

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 30 30 30 30
0 2 2 0 2 0 2 0 2 0 2 2 0 0 2 2 2 0 1 1 1 2 0 2 0 0 0 2 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 2 0 2

输出1.txt

02202020
02101011
02101011

输出2.txt

2022002
1022002
1022002

输出30.txt

0202
0202
0202

输出3.txt

220111
220000
220000

输出4.txt

2020002
2020012
2020012

答案3

好的,也是为了好玩 - 一个纯 Bash 版本(根据要求),严重依赖内置将单词发送到数组并将它们保存到文件中。这些文件的格式很好,为output001.txt ....output030.txt。使用@ringO 修改的数据文件进行测试。未经测试,但在非常大的文件上,它可能比其他文件更节省时间和资源。

数据:

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 30 30 30 30
0 2 2 0 2 0 2 0 2 0 2 2 0 0 2 2 2 0 1 1 1 2 0 2 0 0 0 2 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 2 0 2

来源:

#!/usr/bin/env bash

# genome : to sort genome data sets according to patterns of the first (header)
# line of the file.  Data must be space delimited.  No dependencies.
#
# Usage:
#
#                      ./genome "data.txt" 

# global arrays
sc=(  )             # array of set element counts
sn=(  )             # array of set id numbers

# output_file "set id"

# change the output pattern and digit output width as required - default
# pattern is output.txt and digit width of three : output000.txt
output_file(){
    # format concept: pattern000.txt
    local op='output.txt'     # output pattern
    local ow=3                # output width: 3 => 000
    printf "%s%0${ow}d.%s" "${op%%.*}" "$1" "${op##*.}"
}

# define_sets "input.txt"

# identify sets - get elements count and sets id numbers from file
# header.
define_sets(){
    # declare and initialize
    local a an b c n
    read -r c < "$1"
    read -r a b <<< "$c"
    n=0; sn=( $a )

    # recurse header, identify sets
    until [[ -z $b ]]
    do
        n=$((n+1))
        an=$a
        read -r a b <<< "$b"
        [[ $an == $a ]] || { sn+=( $a ); sc+=( $n ); n=0; }
    done
    n=$((n+1))
    sc+=( $n )
}

# reset_files

# optional function, clears file data, otherwise data is appended to existing
# output files.
reset_files(){
    for s in ${sn[@]}
    do
        > "$(output_file "$s")"
    done
}

# extract_data "input.txt"

# use defined sets to extract data from the input file and send it to required
# output files. Uses nested 'while read' to bypass file header as data is saved.
extract_data(){
    local a c n s fn da=( )
    while read -a da
    do
        while read -a da
        do
            a=0 n=0
            for s in ${sc[@]}
            do
                c="$(echo "${da[@]:$a:$s}")" # words => string
                echo "${c// /}" >> "$(output_file "${sn[$n]}")"  # save
                n=$((n+1))
                a=$((a+s))
            done
        done
    done < "$1"
}

define_sets "$1"    # get data set structure from header
reset_files         # optional, clears and resets files
extract_data "$1"   # get data from input file and save

# end file

数据输出:

$ cat output001.txt 
02202020
02101011
02101011

$ cat output002.txt 
2022002
1022002
1022002

$ cat output003.txt 
220111
220000
220000

$ cat output004.txt 
2020002
2020012
2020012

$ cat output030.txt 
0202
0202
0202

答案4

只是为了好玩,还有其他解决方案:

awk '{ for (i=1; i<=NF;i++){
          if (NR==1) { file[i]=$i }
          if (NR!=1) { f="output" file[i]   ".txt";
                       g="output" file[i+1] ".txt";
                       printf("%s%s",$i,f==g?OFS:ORS)>>f;
                       close(f);
                      }
          }
      }' file

如果您需要不分隔的字段,请更改?OFS:?"":.

接收未配对值的默认文件是output.txt.如果第一行上的列数与处理的下一行不匹配,该文件将接收值。如果一切正确,它应该是空的。如果脚本运行后还存在,则说明某个地方有问题。

相关内容