将“例外”单词添加到 perl 匹配标题脚本的规则中

Question 1

行后

my @titlewords = split /\s+/, $title;    #/ get words

添加代码以从数组中删除单词：

my @new;
foreach my $t (@titlewords){
    push(@new, $t) if $t !~ /^(and|if|the)$/i;
}
@titlewords = @new;

Answer

行后

my @titlewords = split /\s+/, $title;    #/ get words

添加代码以从数组中删除单词：

my @new;
foreach my $t (@titlewords){
    push(@new, $t) if $t !~ /^(and|if|the)$/i;
}
@titlewords = @new;

Question 2

这与 @meuh 的答案非常相似，但您不需要foreach在该split行之后添加循环，而只需使用perlsgrep函数或其map函数在其中添加一行：

@titlewords = grep (!/^(and|if|the)$/i, @titlewords);

或者

@titlewords = map { /^(and|if|the)$/i ? () : $_ } @titlewords;

有关这些函数以及它们之间的差异的更多详细信息，请参阅perldoc -f grep和。它们在许多脚本中perldoc -f map常用（尤其是），因此值得花时间了解它们的作用并了解它们的工作原理。mapperl

顺便说一句，做不是使用#!/usr/bin/env perl。对于脚本env来说，这样的使用已经够糟糕的了（不幸的是，这是惯例），但它对于脚本来说完全被破坏了，而且绝对不是运行它们的传统方式。pythonrubyperl

perl有许多命令行选项，可以显着改变其行为，具体取决于您要编写的程序类型。使用env像这样运行解释器perl完全破坏了将命令行选项传递给解释器的能力（因为env不支持它。 env甚至没有设计用于此目的，这样做只是一个丑陋的黑客，利用的实际目的的副作用env- 在运行程序之前设置环境变量）。

请使用#!/usr/bin/perl（或任何解释器的路径perl）。

这是另一个 perl 脚本，可以执行您想要的操作 - 但这个脚本使用Class::CSV和List::Compare模块，以及两个数组哈希来比较 CSV 文件：

#! /usr/bin/perl

use strict;
use warnings;

use Class::CSV;
use List::Compare;

sub parse_csv($%) {
  my($filename,$tw) = @_;

  # exclude the following word list and the "empty word"
  my @exceptions = qw(and if the);
  my $exceptions = '^(\s*|' . join('|',@exceptions) . ')$';


  my $csv = Class::CSV->parse(
      filename => $filename,
      fields   => [qw/id title num1 num2/]
  );

  # build a hash-of-arrays (HoA), keyed by the CSV line. Each array
  # contains the individual words from each title for that line (except
  # for those matching $exceptions).  The words are all converted to
  # lowercase to enable case-insensitive matches.
  foreach my $line (@{$csv->lines()}) {

    # The following three lines are required because the input file has
    # fields separated by ', ' rather than just ',' which makes
    # Class::CSV interpret the numeric fields as strings.
    # It's easier/quicker to do this than to rewrite using Text::CSV.
    #
    # The final output will be properly-formed CSV, with only a comma as
    # field separator and quotes around the title string.
    my $key = join(',',$line->id,'"'.$line->title.'"',$line->num1,$line->num2);
    $key =~ s/([",])\s+/$1/g;   # trim whitespace immediately following " or ,
    $key =~ s/\s+([",])/$1/g;   # trim whitespace immediately preceding " or ,

    # If it wasn't for the not-quite-right CSV format, we could just use:
    #my $key = $line->string;

    push @{ $tw->{$key} }, grep (!/$exceptions/oi, split(/\s+/,$line->title));
  };
};

# two hashes to hold the titlewords HoAs
my %tw1=();
my %tw2=();

parse_csv('csv1',\%tw1);
parse_csv('csv2',\%tw2);

# now compare the HoAs
foreach my $k2 (sort keys %tw2) {
  my @matches = ();
  foreach my $k1 (sort keys %tw1) {
    my $lc = List::Compare->new('-u', \@{ $tw2{$k2} }, \@{ $tw1{$k1} });
    push @matches, $k1 if ($lc->get_intersection ge 3);
  };
  print join("\n",sort(@matches,$k2)),"\n\n" if (@matches);
};

输出：

11,"The Sun Still Shines in Reading",64312,464566
97,"Reading Still Shines",545464,16748967

每组匹配项都经过排序，即使示例输出没有显示它（因为只有一组匹配项），每组也会打印为单独的段落（即用空行分隔）

顺便说一句，如果您不想在标题字段周围使用双引号，请编辑my $key=join(...)添加它们的行，这样就不会出现这种情况。

Answer