共同记录验证

Question 1

Perl 中的替代方案；假设你的问题的解释与sg-lecram 的：

perl -lne 'tr{ }{}d;      # Remove whitespace in current line
           $lines{$_}++;  # Record the current line in a hash
           END{           # After all lines have been processed
               for(keys %lines){   # Iterate over hash keys
                 #Skip records with different letters:
                 next unless /([a-z]).*\1/i; 
                 ($first,$second)=split /,/; #Read the two fields
                 #Print the record unless its reciprocal is found:
                 print unless exists $lines{"$second,$first"}; 
           }' your_file

Answer

Perl 中的替代方案；假设你的问题的解释与sg-lecram 的：

perl -lne 'tr{ }{}d;      # Remove whitespace in current line
           $lines{$_}++;  # Record the current line in a hash
           END{           # After all lines have been processed
               for(keys %lines){   # Iterate over hash keys
                 #Skip records with different letters:
                 next unless /([a-z]).*\1/i; 
                 ($first,$second)=split /,/; #Read the two fields
                 #Print the record unless its reciprocal is found:
                 print unless exists $lines{"$second,$first"}; 
           }' your_file

Question 2

据我所知，以下规则会产生您想要的输出：

A1,A2：相同的字母（即“组”）-->寻找A2,A1：未找到-->打印A2,A1
B1,B2：相同的字母（即“组”）-->寻找B2,B1：未找到-->打印B2,B1
C1,C2: 相同的字母（即“组”） --> 寻找C2,C1: 找到 --> 不打印
C2,C1: 相同的字母（即“组”） --> 寻找C1,C2: 找到 --> 不打印
A1,C1：不同的字母（即“组”）-->不打印
A1,B1：不同的字母（即“组”）-->不打印
B1,A1：不同的字母（即“组”）-->不打印

因此，如果列表中有A1,A3，也应该打印出来：

A1,A3：相同的字母（即“组”）-->寻找A3,A1：未找到-->打印A3,A1

鉴于我的理解是正确的，您可以执行以下操作：

awk -F, '

  # skip records that do not consist of exactly two different fields
  (NF!=2)||($1==$2){
    next
  }

  # get groups
  {
     g1=substr($1,1,1) # If the groups are not defined as the first...
     g2=substr($2,1,1) # ...character, adjust theses line accordingly.
  }

  # only consider records with matching groups
  g1!=g2{
    next
  }

  # are we looking for the current record?
  ($2 in fst2scd)&&(fst2scd[$2]~FS""$1""FS){

    # remove "reciprocal" pair from the list (assuming record uniqueness -->...
    sub(FS""$1""FS,FS,fst2scd[$2]) # ...consider piping through sort -u first)

    # was that the last record ending with $2 we were looking for (so far)?
    if(fst2scd[$2]==FS){

      # remove $2 from the list (for now)
      delete fst2scd[$2]
    }

    # this "reciprocal" pair is done
    next
  }

  # if we reach this point, we found a new pair
  {

    # is this the first non-"reciprocal" record starting with $1?
    if(!($1 in fst2scd)){

      # add $1 to the list
      fst2scd[$1]=FS
    }

    # start looking for a "reciprocal" record
    fst2scd[$1]=fst2scd[$1]""$2""FS
  }

  # after processing all records, we know all non-"reciprocal" records
  END{

    # use the same separator for output that was used in input
    OFS=FS

    # iterate over all starts of records we are still looking for
    for(fst in fst2scd){

      # remove initial and final FS from list entry
      sub("^"FS,"",fst2scd[fst])
      sub(FS"$","",fst2scd[fst])

      # get all ends of records with the current start we are still looking for
      split(fst2scd[fst],scd,FS)

      # iterate over all the ends obtained in the previous step
      for(i in scd){

        # print the non-"reciprocal" records
        print fst,scd[i]
      }
    }
  }
' <<_INPUT_
A1,A2
B1,B2
C1,C2
C2,C1
A1,C1
A1,B1
B1,A1
A1,A3
A1,A1
_INPUT_

这会产生以下输出：

A1,A2
A1,A3
B1,B2

请注意整个脚本中的使用FS，以允许相同的代码在可能包含在条目中的 TSV 文件上运行,。

如果您需要进一步帮助来了解此代码的工作原理和/或改进/调整，请随时发表评论。

另请注意，我假设您已经GNU awk（即gawk）正在运行。如果不是这种情况，我可以帮助您调整代码以在普通awk.

Answer