在 Perl 或 bash 中聚合和分组文本文件

Question 1

在 Perl 中

perl -F';' -lane 'push @{$h{join ";",@F[0..2]}},$F[3];
                  END{
                    for(sort keys %h){
                        print "$_: ". join ",",@{$h{$_}};
                    }
                  }' your_file

您应该能够awk使用关联数组做类似的事情，但我并不是很精通，awk所以我无法贡献实际的代码。

解释

这是上述代码的扩展版本，它使用尽可能少的“魔法”：

open($FH,"<","your_file");
while($line=<$FH>){ # For each line in the file (accomplished by -n)
    chomp $line; # Remove the newline at the end (done by -l)
    # The ; is set by -F and storing the split in @F done by -a
    @F = split /;/,$line # Split the line into fields on ;
    $app_id = join ";",@F[0..2]; # AppID is the first 3 fields
    push @{$h{$app_id}},$F[3]; # The 4th field is added onto the hash
} # The whole file has been read at this point.
foreach $key (sort keys %h){ # Sort the hash by AppID
     print "$key: " . join ",",@{h{$key}}."\n"; # Print the array values
     # The newline ("\n") added at the end is also done by -l
}

现在只剩push下这句话需要更详细地解释：

push通常用于向数组变量添加元素。例如：
```
push @a,$x
```
将变量的内容追加$x到数组中@a。
逐行读取文件的循环正在填充哈希表 ( %h)。哈希的键是 AppID，与每个键对应的值是一个数组，其中包含与该 AppID 关联的所有用户 ID。这是一个匿名数组（它没有名称）；在 Perl 中，这是作为数组引用实现的（有点类似于 C 指针）。由于%h与 AppID 相对应的值$app_id由表示$h{$app_id}，因此附加 Perl 数组 sigial ( @) 会将哈希值视为数组（取消引用数组引用）并将当前用户 ID 压入其中。
另一种可能会让您感觉不那么“Perlish”的替代方法是将第四个字段连接到当前值：
```
while(...) { ... $h{$app_id} = $h{$app_id} . ",$F[3]" }
foreach $key (sort keys %h) { print "$_: $h{$_}" }
```
其中.Perl 中的是字符串连接运算符。

请注意，在解释代码中，我省略了perl -e '...'包装器，因此语法突出显示可以到达代码并使其更具可读性。

Answer

在 Perl 中

perl -F';' -lane 'push @{$h{join ";",@F[0..2]}},$F[3];
                  END{
                    for(sort keys %h){
                        print "$_: ". join ",",@{$h{$_}};
                    }
                  }' your_file

您应该能够awk使用关联数组做类似的事情，但我并不是很精通，awk所以我无法贡献实际的代码。

解释

这是上述代码的扩展版本，它使用尽可能少的“魔法”：

open($FH,"<","your_file");
while($line=<$FH>){ # For each line in the file (accomplished by -n)
    chomp $line; # Remove the newline at the end (done by -l)
    # The ; is set by -F and storing the split in @F done by -a
    @F = split /;/,$line # Split the line into fields on ;
    $app_id = join ";",@F[0..2]; # AppID is the first 3 fields
    push @{$h{$app_id}},$F[3]; # The 4th field is added onto the hash
} # The whole file has been read at this point.
foreach $key (sort keys %h){ # Sort the hash by AppID
     print "$key: " . join ",",@{h{$key}}."\n"; # Print the array values
     # The newline ("\n") added at the end is also done by -l
}

现在只剩push下这句话需要更详细地解释：

push通常用于向数组变量添加元素。例如：
```
push @a,$x
```
将变量的内容追加$x到数组中@a。
逐行读取文件的循环正在填充哈希表 ( %h)。哈希的键是 AppID，与每个键对应的值是一个数组，其中包含与该 AppID 关联的所有用户 ID。这是一个匿名数组（它没有名称）；在 Perl 中，这是作为数组引用实现的（有点类似于 C 指针）。由于%h与 AppID 相对应的值$app_id由表示$h{$app_id}，因此附加 Perl 数组 sigial ( @) 会将哈希值视为数组（取消引用数组引用）并将当前用户 ID 压入其中。
另一种可能会让您感觉不那么“Perlish”的替代方法是将第四个字段连接到当前值：
```
while(...) { ... $h{$app_id} = $h{$app_id} . ",$F[3]" }
foreach $key (sort keys %h) { print "$_: $h{$_}" }
```
其中.Perl 中的是字符串连接运算符。

请注意，在解释代码中，我省略了perl -e '...'包装器，因此语法突出显示可以到达代码并使其更具可读性。

Question 2

既然您声明文件已排序，那么是否可以使用一个简单的循环来仅存储前面的appId字符串的内存？有点像 @Qeole 的方法，但通过使用 shell 的分隔函数加上字符串比较来sed避免正则表达式的开销：read

#!/bin/bash

appId=""

while IFS=\; read -r s1 s2 s3 userId; do
  if [ "$s1;$s2;$s3" == "$appId" ]; then
    printf ', %s' "$userId"
  else
    appId="$s1;$s2;$s3"
    printf '\n%s:%s' "$appId" "$userId"
  fi
done < yourfile
printf '\n'

注意：这会在输出开始时打印一个额外的换行符，但这可以通过最小的额外复杂性来防止。重击应该是相当对于这种事情来说很快，但如果没有，您可以用几乎任何类似的脚本语言重新实现。

Answer

既然您声明文件已排序，那么是否可以使用一个简单的循环来仅存储前面的appId字符串的内存？有点像 @Qeole 的方法，但通过使用 shell 的分隔函数加上字符串比较来sed避免正则表达式的开销：read

#!/bin/bash

appId=""

while IFS=\; read -r s1 s2 s3 userId; do
  if [ "$s1;$s2;$s3" == "$appId" ]; then
    printf ', %s' "$userId"
  else
    appId="$s1;$s2;$s3"
    printf '\n%s:%s' "$appId" "$userId"
  fi
done < yourfile
printf '\n'

注意：这会在输出开始时打印一个额外的换行符，但这可以通过最小的额外复杂性来防止。重击应该是相当对于这种事情来说很快，但如果没有，您可以用几乎任何类似的脚本语言重新实现。

Question 3

和sed：

sed 's/;/:\t/3;H;1h;x                                                                                        
s/^\(\([^:]*\):.*\)\n\2/\1/                                                                                      
/\n/P;//g;h;$!d' <input |
tr : \\n

打印：

44a934ca4052b34e70f9cb03f3399c6f065becd0;bf038823f9633d25034220b9f10b68dd8c16d867;309
        8ead5b3e0af5b948a6b09916bd271f18fe2678aa
        a21245497cd0520818f8b14d6e405040f2fa8bc0
5c3eb56d91a77d6ee5217009732ff421e378f298;200000000000000001000000200000,6fd299187a5c347fe7eaab516aca72295faac2ad,e25ba62bbd53a72beb39619f309a06386dd381d035de372c85d70176c339d6f4;16
        337556fc485cd094684a72ed01536030bdfae5bb
        382f3aaa9a0347d3af9b35642d09421f9221ef7d
        396529e08c6f8a98a327ee28c38baaf5e7846d14

您可以放下tr以使各组保持在同一行上。这id:在这种情况下将用冒号分隔。您可能还需要用文字字符替换\t第一行中的转义符<tab>- 或者您可以随意\t完全删除abs - 它们只会使输出更具可读性（在我看来）并且对于正则表达式的功能来说并不重要。

Answer

和sed：

sed 's/;/:\t/3;H;1h;x                                                                                        
s/^\(\([^:]*\):.*\)\n\2/\1/                                                                                      
/\n/P;//g;h;$!d' <input |
tr : \\n

打印：

44a934ca4052b34e70f9cb03f3399c6f065becd0;bf038823f9633d25034220b9f10b68dd8c16d867;309
        8ead5b3e0af5b948a6b09916bd271f18fe2678aa
        a21245497cd0520818f8b14d6e405040f2fa8bc0
5c3eb56d91a77d6ee5217009732ff421e378f298;200000000000000001000000200000,6fd299187a5c347fe7eaab516aca72295faac2ad,e25ba62bbd53a72beb39619f309a06386dd381d035de372c85d70176c339d6f4;16
        337556fc485cd094684a72ed01536030bdfae5bb
        382f3aaa9a0347d3af9b35642d09421f9221ef7d
        396529e08c6f8a98a327ee28c38baaf5e7846d14

您可以放下tr以使各组保持在同一行上。这id:在这种情况下将用冒号分隔。您可能还需要用文字字符替换\t第一行中的转义符<tab>- 或者您可以随意\t完全删除abs - 它们只会使输出更具可读性（在我看来）并且对于正则表达式的功能来说并不重要。

Question 4

一个sed答案：

sed ': l;N;s/^\([^;]\+;[^;]\+;[^;:]\+\)[;:] *\(.*\)\n\1;\(.*\)/\1: \2, \3/;tl;P;D' input_file.txt

文件只读取一次，因此性能应该不会太差，但我不能告诉你更多。

详细信息：

sed ': l;        # Label l

     N;          # Add next line of input to pattern space

     s/^\([^;]\+;[^;]\+;[^;:]\+\)[;:] *\(.*\)\n\1;\(.*\)/\1: \2, \3/;
                 # If two lines in pattern space start with same AppID, then
                 # take user ID and append it to first line, then delete second line

         tl;     # If previous substitution succeeded, i.e. we scanned two lines with 
                 # same AppID, then loop to label l. Else go on…

     P;          # Print first line from pattern space (here there should be two lines
                 # in pattern space, starting with a different AppID)

     D;          # Delete first line of pattern space; start script again with
                 # remaining text in pattern space, or next input line if pattern
                 # space is empty
    ' input_file.txt

（但我不知道行长度的潜在限制，抱歉。）

Answer

一个sed答案：

sed ': l;N;s/^\([^;]\+;[^;]\+;[^;:]\+\)[;:] *\(.*\)\n\1;\(.*\)/\1: \2, \3/;tl;P;D' input_file.txt

文件只读取一次，因此性能应该不会太差，但我不能告诉你更多。

详细信息：

sed ': l;        # Label l

     N;          # Add next line of input to pattern space

     s/^\([^;]\+;[^;]\+;[^;:]\+\)[;:] *\(.*\)\n\1;\(.*\)/\1: \2, \3/;
                 # If two lines in pattern space start with same AppID, then
                 # take user ID and append it to first line, then delete second line

         tl;     # If previous substitution succeeded, i.e. we scanned two lines with 
                 # same AppID, then loop to label l. Else go on…

     P;          # Print first line from pattern space (here there should be two lines
                 # in pattern space, starting with a different AppID)

     D;          # Delete first line of pattern space; start script again with
                 # remaining text in pattern space, or next input line if pattern
                 # space is empty
    ' input_file.txt

（但我不知道行长度的潜在限制，抱歉。）

在 Perl 或 bash 中聚合和分组文本文件

答案1

答案2

答案3

答案4

相关内容