选择具有相同值的行

2024-5-20 • tag-icon

text-processing scripting sort

选择具有相同值的行

我在选择具有相同值的行时遇到问题。我的数据太大，无法逐行执行此操作。我希望你们让我知道可以执行此操作的脚本。

我的数据如下所示：

文件名：temp

Start day   hour    end day        hour Value
01/04/2000  22:00   01/05/2000  09:00   -9
01/05/2000  09:00   01/06/2000  09:00   -9
01/06/2000  09:00   01/07/2000  09:00   -9
01/07/2000  09:00   01/08/2000  09:00   -9
01/08/2000  09:00   01/09/2000  09:00   -9
01/09/2000  09:00   01/10/2000  09:00   -9
01/10/2000  09:00   01/11/2000  09:00   -9
01/11/2000  09:00   01/11/2000  21:30   -9
01/11/2000  22:30   01/12/2000  09:00   -9
01/12/2000  09:00   01/13/2000  09:00   -9
01/15/2000  09:00   01/16/2000  09:00   -9
01/16/2000  09:00   01/17/2000  09:00   -9
01/17/2000  09:00   01/18/2000  09:00   -9
01/18/2000  09:00   01/18/2000  22:45   -9
01/18/2000  22:50   01/19/2000  09:00   0.15
01/19/2000  09:00   01/20/2000  09:00   -9
01/20/2000  09:00   01/21/2000  09:00   -9
01/21/2000  09:00   01/22/2000  09:00   -9
01/22/2000  09:00   01/23/2000  09:00   -9
01/23/2000  09:00   01/24/2000  09:00   -9
01/24/2000  09:00   01/25/2000  09:00   -9
01/25/2000  09:00   01/26/2000  00:35   -9
01/26/2000  00:35   01/26/2000  09:00   -9
01/26/2000  09:00   01/27/2000  09:00   -9

例如，01/18/2000 以上出现两次“开始日”和两次“结束日”。因此，我想包括01/18/2000“开始日”或“结束日”的行。

我希望上述数据的输出是：

Start day   hour    end day        hour Value
01/10/2000  09:00   01/11/2000  09:00   -9
01/11/2000  09:00   01/11/2000  21:30   -9
01/11/2000  22:30   01/12/2000  09:00   -9
01/17/2000  09:00   01/18/2000  09:00   -9
01/18/2000  09:00   01/18/2000  22:45   -9
01/18/2000  22:50   01/19/2000  09:00   0.15
01/25/2000  09:00   01/26/2000  00:35   -9
01/26/2000  00:35   01/26/2000  09:00   -9
01/26/2000  09:00   01/27/2000  09:00   -9

答案1

如果我理解正确的话，您需要开始日期或结束日期重复的行。那么也许是这样的：

awk 'NR==FNR{s[$1]++;e[$3]++;next}
     FNR == 1 || s[$1]>1 || e[$3]>1' temp temp

即在文件中进行两次传递。在第一遍中，计算开始日期和结束日期的出现次数，在第二遍中，输出开始日期或结束日期的出现次数大于 1 的行。

答案2

如果它只是具有相同开始日期和结束日期的行（并且不引用上一行）：

perl -ne 'print if(m!^(\d{2}/\d{2}/\d{4})\s+\d{2}:\d{2}\s+\1!);' < file

^行首

(\d{2}/\d{2}/\d{4})匹配日期和商店（所以我们可以用引用它\1）

\s+\d{2}:\d{2}\s+1 个或多个空格 2 个数字冒号 2 个数字，然后 1 个或多个空格

\1“反向引用”存储的日期

如果匹配，则print该行。

答案3

我编写了一个 Perl 脚本，希望能够满足您的需求。它假设您在示例中提供的数据位于名为的文件中temp。

#!/usr/bin/perl

### ./timetract.pl

## 01/10/2000  09:00   01/11/2000  09:00   -9
## 01/11/2000  09:00   01/11/2000  21:30   -9
## 01/11/2000  22:30   01/12/2000  09:00   -9
## ...
## 01/17/2000  09:00   01/18/2000  09:00   -9
## 01/18/2000  09:00   01/18/2000  22:45   -9
## 01/18/2000  22:50   01/19/2000  09:00   0.15
#  ...
## 01/25/2000  09:00   01/26/2000  00:35   -9
## 01/26/2000  00:35   01/26/2000  09:00   -9
## 01/26/2000  09:00   01/27/2000  09:00   -9
## 01/27/2000  09:00   01/28/2000  09:00   -9

use strict;
use warnings;
use feature qw( say );

open (my $fh, "<", "temp") || die "Can't open temp: $!";

my ($prevEndDate, @middleRow, $s1, $s2, $mRow) = "";

for my $cRow (<$fh>) {
  chomp($cRow);

  my @currentRow = split(/\s+/, $cRow);
  next if $currentRow[0] =~ /Start/;  # skip first row

  ## col1        col2    col3        col4    col5
  ## ----        ----    ----        ----    ----
  ## 01/27/2000  09:00   01/28/2000  09:00   -9

  # identify that we're on the last row of a block that
  # we're interested in, print it, reset & go to the next row
  if ($currentRow[0] eq $prevEndDate && $s2) {
    say $cRow;
    $s1 = $s2 = 0; # reset states, get ready for next block
    next;
  }

  # identify that we're in the middle of a block that
  # we're interested in, so save current row as a middle row
  if ($currentRow[0] ne $currentRow[2]) {
    $prevEndDate = $currentRow[2];  
    @middleRow   = @currentRow;
    $mRow        = $cRow;
    next;
  }

  # identified beginning row of a block of rows that we're interested in
  $s1 = 1 if ($prevEndDate eq $currentRow[0]);
  # identified middle row of a block of rows that we're interested in
  $s2 = 1 if ($s1 == 1 && $currentRow[0] eq $currentRow[2]);

  say $mRow;
  say $cRow;
}

close ($fh);

# vim: set ts=2 nolist :

当您运行它时，您将看到以下输出：

$ ./timeextract.pl 
01/10/2000  09:00   01/11/2000  09:00   -9
01/11/2000  09:00   01/11/2000  21:30   -9
01/11/2000  22:30   01/12/2000  09:00   -9
01/17/2000  09:00   01/18/2000  09:00   -9
01/18/2000  09:00   01/18/2000  22:45   -9
01/18/2000  22:50   01/19/2000  09:00   0.15
01/25/2000  09:00   01/26/2000  00:35   -9
01/26/2000  00:35   01/26/2000  09:00   -9
01/26/2000  09:00   01/27/2000  09:00   -9

相关内容