根据列值删除重复的行

Question 1

由于输入似乎已经按第二列分组/排序，因此这应该非常简单，没有需要在内存中保存和排序整个数据集，每次只能保存两条记录^。1

我首先想到了一个 Awk 解决方案，但发现它处理数组和非空白字段分隔符太笨拙了。然后我决定使用一个简短的 Python 程序：

#!/usr/bin/python3
import sys
DELIMITER = ','

def remove_duplicates(records):
    prev = None
    for r in records:
        r = (int(r[0]), int(r[1]), int(r[2]), float(r[3]), int(r[4]))
        if prev is None:
            prev = r
        elif r[1] != prev[1]:
            yield prev
            prev = r
        elif r[3] > prev[3]:
            prev = r
    if prev is not None:
        yield prev

def main():
    for r in remove_duplicates(
        l.rstrip('\n').rsplit(DELIMITER) for l in sys.stdin
    ):
        print(*r, sep=',')

if __name__ == '__main__':
    main()

在我的系统上，它的吞吐量约为每 CPU 每秒 250,000 条记录或 5 MB。

用法

python3 remove-duplicates.py < input.txt > output.txt

该程序无法处理列标题，因此您需要将其删除：

tail -n +2 < input.txt | python3 remove-duplicates.py > output.txt

如果您想将它们添加回结果中：

{ read -r header && printf '%s\n' "$header" && python3 remove-duplicates.py; } < input.txt > output.txt

¹这是沃尔蒂纳托的和钢制司机的针对不适合主内存的数据集的方法。

Answer

由于输入似乎已经按第二列分组/排序，因此这应该非常简单，没有需要在内存中保存和排序整个数据集，每次只能保存两条记录^。1

我首先想到了一个 Awk 解决方案，但发现它处理数组和非空白字段分隔符太笨拙了。然后我决定使用一个简短的 Python 程序：

#!/usr/bin/python3
import sys
DELIMITER = ','

def remove_duplicates(records):
    prev = None
    for r in records:
        r = (int(r[0]), int(r[1]), int(r[2]), float(r[3]), int(r[4]))
        if prev is None:
            prev = r
        elif r[1] != prev[1]:
            yield prev
            prev = r
        elif r[3] > prev[3]:
            prev = r
    if prev is not None:
        yield prev

def main():
    for r in remove_duplicates(
        l.rstrip('\n').rsplit(DELIMITER) for l in sys.stdin
    ):
        print(*r, sep=',')

if __name__ == '__main__':
    main()

在我的系统上，它的吞吐量约为每 CPU 每秒 250,000 条记录或 5 MB。

用法

python3 remove-duplicates.py < input.txt > output.txt

该程序无法处理列标题，因此您需要将其删除：

tail -n +2 < input.txt | python3 remove-duplicates.py > output.txt

如果您想将它们添加回结果中：

{ read -r header && printf '%s\n' "$header" && python3 remove-duplicates.py; } < input.txt > output.txt

¹这是沃尔蒂纳托的和钢制司机的针对不适合主内存的数据集的方法。

Question 2

如果你把它们按以下方式排序减少第 4 个字段的顺序，你可以简单地使用关联数组或哈希表获取每个第二个字段值的第一次出现，例如awk -F, '!seen[$2]++' file或perl -F, -ne 'print $_ unless $seen{$F[1]}++'

随着值的递增顺序，以有效的单次传递方式完成此操作会有些棘手 - 您可以通过在每次键值发生变化时打印前一行来完成此操作（只需进行一些设置）：

awk -F, '
  NR==1 {print; next}        # print the header line
  NR==2 {key=$2; next}       # initialize the comparison
  $2 != key {
    print lastval; key = $2  # print the last (largest) value of the previous key group
  } 
  {lastval = $0}             # save the current line
  END {print lastval}        # clean up
' file
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.4,-1
2,10482423,45,0.43,-1
2,10482424,45,0.49,-1
2,10482425,45,0.52,-1
2,10482426,45,0.64,-1
2,10482427,45,0.73,-1

Answer

如果你把它们按以下方式排序减少第 4 个字段的顺序，你可以简单地使用关联数组或哈希表获取每个第二个字段值的第一次出现，例如awk -F, '!seen[$2]++' file或perl -F, -ne 'print $_ unless $seen{$F[1]}++'

随着值的递增顺序，以有效的单次传递方式完成此操作会有些棘手 - 您可以通过在每次键值发生变化时打印前一行来完成此操作（只需进行一些设置）：

awk -F, '
  NR==1 {print; next}        # print the header line
  NR==2 {key=$2; next}       # initialize the comparison
  $2 != key {
    print lastval; key = $2  # print the last (largest) value of the previous key group
  } 
  {lastval = $0}             # save the current line
  END {print lastval}        # clean up
' file
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.4,-1
2,10482423,45,0.43,-1
2,10482424,45,0.49,-1
2,10482425,45,0.52,-1
2,10482426,45,0.64,-1
2,10482427,45,0.73,-1

Question 3

如果您没有太多唯一的 Cell_id，您可以在 Perl 关联数组中跟踪已看到的 Cell_id。如果您有太多唯一的 Cell_id（并且我的 Perl 脚本内存不足），请编写一个C程序将唯一的 Cell_id 保存在位字段中。这是 Perl。

#!/usr/bin/perl -w
use strict;
my %seen = ();          # key=Cell_ID, value=1
my @cols=();            # for splitting input

while( <> ) {           # read STDIN
  @cols = split ',',$_;
  next if ( defined $seen{$cols[1]}); # skip if we already saw this Cell_Id
  $seen{$cols[1]} = 1;
  print;
}

这是我的测试：

walt@bat:~(0)$ cat u.dat
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.06,-1
2,10482422,45,0.18,-1
2,10482422,45,0.4,-1
2,10482423,45,0.15,-1
2,10482423,45,0.43,-1
2,10482424,45,0.18,-1
2,10482424,45,0.49,-1
2,10482425,45,0.21,-1
2,10482425,45,0.52,-1
2,10482426,45,0.27,-1
2,10482426,45,0.64,-1
2,10482427,45,0.09,-1
2,10482427,45,0.34,-1
2,10482427,45,0.73,-1
walt@bat:~(0)$ perl ./unique.pl u.dat
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.06,-1
2,10482423,45,0.15,-1
2,10482424,45,0.18,-1
2,10482425,45,0.21,-1
2,10482426,45,0.27,-1
2,10482427,45,0.09,-1

Answer

如果您没有太多唯一的 Cell_id，您可以在 Perl 关联数组中跟踪已看到的 Cell_id。如果您有太多唯一的 Cell_id（并且我的 Perl 脚本内存不足），请编写一个C程序将唯一的 Cell_id 保存在位字段中。这是 Perl。

#!/usr/bin/perl -w
use strict;
my %seen = ();          # key=Cell_ID, value=1
my @cols=();            # for splitting input

while( <> ) {           # read STDIN
  @cols = split ',',$_;
  next if ( defined $seen{$cols[1]}); # skip if we already saw this Cell_Id
  $seen{$cols[1]} = 1;
  print;
}

这是我的测试：

walt@bat:~(0)$ cat u.dat
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.06,-1
2,10482422,45,0.18,-1
2,10482422,45,0.4,-1
2,10482423,45,0.15,-1
2,10482423,45,0.43,-1
2,10482424,45,0.18,-1
2,10482424,45,0.49,-1
2,10482425,45,0.21,-1
2,10482425,45,0.52,-1
2,10482426,45,0.27,-1
2,10482426,45,0.64,-1
2,10482427,45,0.09,-1
2,10482427,45,0.34,-1
2,10482427,45,0.73,-1
walt@bat:~(0)$ perl ./unique.pl u.dat
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.06,-1
2,10482423,45,0.15,-1
2,10482424,45,0.18,-1
2,10482425,45,0.21,-1
2,10482426,45,0.27,-1
2,10482427,45,0.09,-1

根据列值删除重复的行

答案1

用法

答案2

答案3

相关内容