根据特定列求和

Question 1

另一个perl解决方案，类似于@terdon的答案，但具有更好的格式输出：

$ perl -alne '
    (print && next) if $. == 1;   
    $h{"@F[0..3]"}{s} += $F[4];
    $h{"@F[0..3]"}{t}  = $F[5];
    END {
        for (keys %h) {
            printf "%-4s%-4s%-4s%-4s%-4s%-4s",split(" ",$_),$h{$_}{s},$h{$_}{t};                        
            printf "\n";
        }
    }' file
c1  c2  c3  c4  c5  c6
A   B   E   F   13  S   
A   B   C   D   9   s   
C   D   E   F   9   S

Answer

另一个perl解决方案，类似于@terdon的答案，但具有更好的格式输出：

$ perl -alne '
    (print && next) if $. == 1;   
    $h{"@F[0..3]"}{s} += $F[4];
    $h{"@F[0..3]"}{t}  = $F[5];
    END {
        for (keys %h) {
            printf "%-4s%-4s%-4s%-4s%-4s%-4s",split(" ",$_),$h{$_}{s},$h{$_}{t};                        
            printf "\n";
        }
    }' file
c1  c2  c3  c4  c5  c6
A   B   E   F   13  S   
A   B   C   D   9   s   
C   D   E   F   9   S

Question 2

关于工具的选择：通常，工具越专业，速度就越快。因此，涉及tr, cut, grep,sort等的管道往往sed比awk, perl,更快python。ruby但这当然也很大程度上取决于任务。如果您读到 Perl 更快，那么您可能误读了，或者比较是针对一次处理一行的 shell 循环（对于具有数百万行的文件来说，这肯定会很慢）。

如果您的输入采用要合并的行是连续的形式，那么 awk 是一个不错的选择（在 sed 中没有执行添加的合理方法）。

awk -v OFS='\t' '                      # use tabs to separate output fields
    NR==1 {print; next}                # keep the first line intact
    function flush () {                # function to print a completed sum
        if (key != "") print previous, sum, more;
        sum=0
    }
    {key = $1 OFS $2 OFS $3 OFS $4}    # break out the comparison key
    key!=previous {flush()}            # if the comparison key has changed, print the accumulated sum
    {previous=key; sum+=$5; more=$6}   # save the current line
    END {flush()}                      # print the last 
'

如果行不连续，可以通过排序使它们连续。典型的sort实现是高度优化的，并且比用高级语言操作数据结构更快。

sort | awk …

这假设您的列分隔符是一致的，例如始终是制表符。如果不是，请对输入进行预处理以使它们成为这样，或者用于sort -k1,1 -k2,2 -k3,3 -k4,4比较这些特定字段而不考虑分隔符。

Answer

关于工具的选择：通常，工具越专业，速度就越快。因此，涉及tr, cut, grep,sort等的管道往往sed比awk, perl,更快python。ruby但这当然也很大程度上取决于任务。如果您读到 Perl 更快，那么您可能误读了，或者比较是针对一次处理一行的 shell 循环（对于具有数百万行的文件来说，这肯定会很慢）。

如果您的输入采用要合并的行是连续的形式，那么 awk 是一个不错的选择（在 sed 中没有执行添加的合理方法）。

awk -v OFS='\t' '                      # use tabs to separate output fields
    NR==1 {print; next}                # keep the first line intact
    function flush () {                # function to print a completed sum
        if (key != "") print previous, sum, more;
        sum=0
    }
    {key = $1 OFS $2 OFS $3 OFS $4}    # break out the comparison key
    key!=previous {flush()}            # if the comparison key has changed, print the accumulated sum
    {previous=key; sum+=$5; more=$6}   # save the current line
    END {flush()}                      # print the last 
'

如果行不连续，可以通过排序使它们连续。典型的sort实现是高度优化的，并且比用高级语言操作数据结构更快。

sort | awk …

这假设您的列分隔符是一致的，例如始终是制表符。如果不是，请对输入进行预处理以使它们成为这样，或者用于sort -k1,1 -k2,2 -k3,3 -k4,4比较这些特定字段而不考虑分隔符。

Question 3

这可以帮助您开始：

perl -ane '$h{"@F[0 .. 3]"} += $F[4] }{ print "$_ $h{$_}\n" for keys %h' input-file

它不会打印最后一列，因为您没有指定如何处理它。另外，它不能正确处理标题行，但应该很容易修复。

Answer

这可以帮助您开始：

perl -ane '$h{"@F[0 .. 3]"} += $F[4] }{ print "$_ $h{$_}\n" for keys %h' input-file

它不会打印最后一列，因为您没有指定如何处理它。另外，它不能正确处理标题行，但应该很容易修复。

Question 4

如果我理解正确的话，你想要这样的东西：

$ perl -lane 'if($.>1){$k{"@F[0..3]"}{sum}+=$F[4]; $k{"@F[0..3]"}{last}=$F[5]}
              else{print "@F"}
              END{
                foreach (keys(%k)){ print "$_ $k{$_}{sum} $k{$_}{last}"}
              }' file
c1 c2 c3 c4 c5 c6
C D E F 9 S
A B E F 13 S
A B C D 9 s

这不会使您的列保持对齐，我不知道这对您来说是否是一个问题。但是，它会正确处理标头并产生您需要的输出。

解释

perl -lane：-l从每个字符串末尾删除换行符并将其添加到每个print语句中。将a每个输入行拆分为空格上的字段，并将这些字段保存在数组中@F。办法n逐行读取输入文件并应用以下给出的脚本-e。

这是注释脚本形式的相同单行：

#!/usr/bin/env perl

## This is the equivalent of perl -ne
## in the one-liner. It iterates through
## the input file.
while (<>) {
    
    ## This is what the -a flag does
    my @F=split(/\s+/);
    ## $. is the current line number.
    ## This simply tests whether we are on the
    ## first line or not.
    if ($.>1) {
    ## @F[0..3] is an array slice. It holds fields 1 through 4.
    ## The slice is used as a key for the hash %k and the 5th
    ## field is summed to $k{slice}{sum} while the last column is 
    ## saved as $k{slice}{last}.
    $k{"@F[0..3]"}{sum}+=$F[4]; $k{"@F[0..3]"}{last}=$F[5];
    }

    ## If this is the first line, print the fields.
    ## I am using print "@F" instead of a simple print 
    ## so that all lines are formatted in the same way.
    else {
    print "@F\n";
    }
}

## This is the same as the END{} block
## in the one liner. It will be run after
## the whole file has been read.

## For each of the keys of the hash %k
foreach (keys(%k)){ 
    ## Print the key ($_, a special variable in Perl),
    ## the value of $k{$key}{sum} (the summed values),
    ## and the last column.
    print "$_ $k{$_}{sum} $k{$_}{last}\n"
}

Answer

如果我理解正确的话，你想要这样的东西：

$ perl -lane 'if($.>1){$k{"@F[0..3]"}{sum}+=$F[4]; $k{"@F[0..3]"}{last}=$F[5]}
              else{print "@F"}
              END{
                foreach (keys(%k)){ print "$_ $k{$_}{sum} $k{$_}{last}"}
              }' file
c1 c2 c3 c4 c5 c6
C D E F 9 S
A B E F 13 S
A B C D 9 s

这不会使您的列保持对齐，我不知道这对您来说是否是一个问题。但是，它会正确处理标头并产生您需要的输出。

解释

perl -lane：-l从每个字符串末尾删除换行符并将其添加到每个print语句中。将a每个输入行拆分为空格上的字段，并将这些字段保存在数组中@F。办法n逐行读取输入文件并应用以下给出的脚本-e。

这是注释脚本形式的相同单行：

#!/usr/bin/env perl

## This is the equivalent of perl -ne
## in the one-liner. It iterates through
## the input file.
while (<>) {
    
    ## This is what the -a flag does
    my @F=split(/\s+/);
    ## $. is the current line number.
    ## This simply tests whether we are on the
    ## first line or not.
    if ($.>1) {
    ## @F[0..3] is an array slice. It holds fields 1 through 4.
    ## The slice is used as a key for the hash %k and the 5th
    ## field is summed to $k{slice}{sum} while the last column is 
    ## saved as $k{slice}{last}.
    $k{"@F[0..3]"}{sum}+=$F[4]; $k{"@F[0..3]"}{last}=$F[5];
    }

    ## If this is the first line, print the fields.
    ## I am using print "@F" instead of a simple print 
    ## so that all lines are formatted in the same way.
    else {
    print "@F\n";
    }
}

## This is the same as the END{} block
## in the one liner. It will be run after
## the whole file has been read.

## For each of the keys of the hash %k
foreach (keys(%k)){ 
    ## Print the key ($_, a special variable in Perl),
    ## the value of $k{$key}{sum} (the summed values),
    ## and the last column.
    print "$_ $k{$_}{sum} $k{$_}{last}\n"
}

根据特定列求和

答案1

答案2

答案3

答案4

解释

相关内容