我一直在使用这个 perl 脚本(感谢 Jeff Schaller)来匹配两个单独的 csv 文件的标题字段中的 3 个或更多单词,如下所示:
脚本是:
#!/usr/bin/env perl
my @csv2 = ();
open CSV2, "<csv2" or die;
@csv2=<CSV2>;
close CSV2;
my %csv2hash = ();
for (@csv2) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
$csv2hash{$_} = $title;
}
open CSV1, "<csv1" or die;
while (<CSV1>) {
chomp;
my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ match the title
my @titlewords = split /\s+/, $title; #/ get words
my $desired = 3;
my $matched = 0;
foreach my $csv2 (keys %csv2hash) {
my $count = 0;
my $value = $csv2hash{$csv2};
foreach my $word (@titlewords) {
++$count if $value =~ /\b$word\b/i;
last if $count >= $desired;
}
if ($count >= $desired) {
print "$csv2\n";
++$matched;
}
}
print "$_\n" if $matched;
}
close CSV1;
从那时起,我意识到我想忽略标题之间的某些单词,而不将它们归类为匹配单词。我一直在比较 csv 文件之前使用 sed 删除它们,但这并不理想,因为我在此过程中丢失了数据。如何添加被视为此 perl 脚本例外的单词?例如,假设我希望脚本在匹配标题时忽略三个单独的单词and
if
,the
以便它们成为规则的例外。
答案1
行后
my @titlewords = split /\s+/, $title; #/ get words
添加代码以从数组中删除单词:
my @new;
foreach my $t (@titlewords){
push(@new, $t) if $t !~ /^(and|if|the)$/i;
}
@titlewords = @new;
答案2
这与 @meuh 的答案非常相似,但您不需要foreach
在该split
行之后添加循环,而只需使用perl
sgrep
函数或其map
函数在其中添加一行:
@titlewords = grep (!/^(and|if|the)$/i, @titlewords);
或者
@titlewords = map { /^(and|if|the)$/i ? () : $_ } @titlewords;
有关这些函数以及它们之间的差异的更多详细信息,请参阅perldoc -f grep
和。它们在许多脚本中perldoc -f map
常用(尤其是) ,因此值得花时间了解它们的作用并了解它们的工作原理。map
perl
顺便说一句,做不是使用#!/usr/bin/env perl
。对于脚本env
来说,这样的使用已经够糟糕的了(不幸的是,这是惯例),但它对于脚本来说完全被破坏了,而且绝对不是运行它们的传统方式。python
ruby
perl
perl
有许多命令行选项,可以显着改变其行为,具体取决于您要编写的程序类型。使用env
像这样运行解释器perl
完全破坏了将命令行选项传递给解释器的能力(因为env
不支持它。 env
甚至没有设计用于此目的,这样做只是一个丑陋的黑客,利用的实际目的的副作用env
- 在运行程序之前设置环境变量)。
请使用#!/usr/bin/perl
(或任何解释器的路径perl
)。
这是另一个 perl 脚本,可以执行您想要的操作 - 但这个脚本使用Class::CSV
和List::Compare
模块,以及两个数组哈希来比较 CSV 文件:
#! /usr/bin/perl
use strict;
use warnings;
use Class::CSV;
use List::Compare;
sub parse_csv($%) {
my($filename,$tw) = @_;
# exclude the following word list and the "empty word"
my @exceptions = qw(and if the);
my $exceptions = '^(\s*|' . join('|',@exceptions) . ')$';
my $csv = Class::CSV->parse(
filename => $filename,
fields => [qw/id title num1 num2/]
);
# build a hash-of-arrays (HoA), keyed by the CSV line. Each array
# contains the individual words from each title for that line (except
# for those matching $exceptions). The words are all converted to
# lowercase to enable case-insensitive matches.
foreach my $line (@{$csv->lines()}) {
# The following three lines are required because the input file has
# fields separated by ', ' rather than just ',' which makes
# Class::CSV interpret the numeric fields as strings.
# It's easier/quicker to do this than to rewrite using Text::CSV.
#
# The final output will be properly-formed CSV, with only a comma as
# field separator and quotes around the title string.
my $key = join(',',$line->id,'"'.$line->title.'"',$line->num1,$line->num2);
$key =~ s/([",])\s+/$1/g; # trim whitespace immediately following " or ,
$key =~ s/\s+([",])/$1/g; # trim whitespace immediately preceding " or ,
# If it wasn't for the not-quite-right CSV format, we could just use:
#my $key = $line->string;
push @{ $tw->{$key} }, grep (!/$exceptions/oi, split(/\s+/,$line->title));
};
};
# two hashes to hold the titlewords HoAs
my %tw1=();
my %tw2=();
parse_csv('csv1',\%tw1);
parse_csv('csv2',\%tw2);
# now compare the HoAs
foreach my $k2 (sort keys %tw2) {
my @matches = ();
foreach my $k1 (sort keys %tw1) {
my $lc = List::Compare->new('-u', \@{ $tw2{$k2} }, \@{ $tw1{$k1} });
push @matches, $k1 if ($lc->get_intersection ge 3);
};
print join("\n",sort(@matches,$k2)),"\n\n" if (@matches);
};
输出:
11,"The Sun Still Shines in Reading",64312,464566
97,"Reading Still Shines",545464,16748967
每组匹配项都经过排序,即使示例输出没有显示它(因为只有一组匹配项),每组也会打印为单独的段落(即用空行分隔)
顺便说一句,如果您不想在标题字段周围使用双引号,请编辑my $key=join(...)
添加它们的行,这样就不会出现这种情况。