将“例外”单词添加到 perl 匹配标题脚本的规则中

我一直在使用这个 perl 脚本(感谢 Jeff Schaller)来匹配两个单独的 csv 文件的标题字段中的 3 个或更多单词,如下所示:

#!/usr/bin/env perl

my @csv2 = ();
open CSV2, "<csv2" or die;
close CSV2;

my %csv2hash = ();
for (@csv2) {
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title 
  $csv2hash{$_} = $title;

open CSV1, "<csv1" or die;
while (<CSV1>) {
  my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title 
  my @titlewords = split /\s+/, $title;    #/ get words
  my $desired = 3;
  my $matched = 0;
  foreach my $csv2 (keys %csv2hash) {
    my $count = 0;
    my $value = $csv2hash{$csv2};
    foreach my $word (@titlewords) {
      ++$count if $value =~ /\b$word\b/i;
      last if $count >= $desired;
    if ($count >= $desired) {
      print "$csv2\n";
  print "$_\n" if $matched;
close CSV1;

从那时起,我意识到我想忽略标题之间的某些单词,而不将它们归类为匹配单词。我一直在比较 csv 文件之前使用 sed 删除它们,但这并不理想,因为我在此过程中丢失了数据。如何添加被视为此 perl 脚本例外的单词?例如,假设我希望脚本在匹配标题时忽略三个单独的单词and ifthe以便它们成为规则的例外。



my @titlewords = split /\s+/, $title;    #/ get words


my @new;
foreach my $t (@titlewords){
    push(@new, $t) if $t !~ /^(and|if|the)$/i;
@titlewords = @new;


这与 @meuh 的答案非常相似,但您不需要foreach在该split行之后添加循环,而只需使用perlsgrep函数或其map函数在其中添加一行:

@titlewords = grep (!/^(and|if|the)$/i, @titlewords);


@titlewords = map { /^(and|if|the)$/i ? () : $_ } @titlewords;

有关这些函数以及它们之间的差异的更多详细信息,请参阅perldoc -f grep和。它们在许多脚本中perldoc -f map常用(尤其是) ,因此值得花时间了解它们的作用并了解它们的工作原理。mapperl

顺便说一句,做不是使用#!/usr/bin/env perl。对于脚本env来说,这样的使用已经够糟糕的了(不幸的是,这是惯例),但它对于脚本来说完全被破坏了,而且绝对不是运行它们的传统方式。pythonrubyperl

perl有许多命令行选项,可以显着改变其行为,具体取决于您要编写的程序类型。使用env像这样运行解释器perl完全破坏了将命令行选项传递给解释器的能力(因为env不支持它。 env甚至没有设计用于此目的,这样做只是一个丑陋的黑客,利用的实际目的的副作用env- 在运行程序之前设置环境变量)。


这是另一个 perl 脚本,可以执行您想要的操作 - 但这个脚本使用Class::CSVList::Compare模块,以及两个数组哈希来比较 CSV 文件:

#! /usr/bin/perl

use strict;
use warnings;

use Class::CSV;
use List::Compare;

sub parse_csv($%) {
  my($filename,$tw) = @_;

  # exclude the following word list and the "empty word"
  my @exceptions = qw(and if the);
  my $exceptions = '^(\s*|' . join('|',@exceptions) . ')$';

  my $csv = Class::CSV->parse(
      filename => $filename,
      fields   => [qw/id title num1 num2/]

  # build a hash-of-arrays (HoA), keyed by the CSV line. Each array
  # contains the individual words from each title for that line (except
  # for those matching $exceptions).  The words are all converted to
  # lowercase to enable case-insensitive matches.
  foreach my $line (@{$csv->lines()}) {

    # The following three lines are required because the input file has
    # fields separated by ', ' rather than just ',' which makes
    # Class::CSV interpret the numeric fields as strings.
    # It's easier/quicker to do this than to rewrite using Text::CSV.
    # The final output will be properly-formed CSV, with only a comma as
    # field separator and quotes around the title string.
    my $key = join(',',$line->id,'"'.$line->title.'"',$line->num1,$line->num2);
    $key =~ s/([",])\s+/$1/g;   # trim whitespace immediately following " or ,
    $key =~ s/\s+([",])/$1/g;   # trim whitespace immediately preceding " or ,

    # If it wasn't for the not-quite-right CSV format, we could just use:
    #my $key = $line->string;

    push @{ $tw->{$key} }, grep (!/$exceptions/oi, split(/\s+/,$line->title));

# two hashes to hold the titlewords HoAs
my %tw1=();
my %tw2=();


# now compare the HoAs
foreach my $k2 (sort keys %tw2) {
  my @matches = ();
  foreach my $k1 (sort keys %tw1) {
    my $lc = List::Compare->new('-u', \@{ $tw2{$k2} }, \@{ $tw1{$k1} });
    push @matches, $k1 if ($lc->get_intersection ge 3);
  print join("\n",sort(@matches,$k2)),"\n\n" if (@matches);


11,"The Sun Still Shines in Reading",64312,464566
97,"Reading Still Shines",545464,16748967


顺便说一句,如果您不想在标题字段周围使用双引号,请编辑my $key=join(...)添加它们的行,这样就不会出现这种情况。
