比较文件中每行对之间的相似度或编辑距离?

比较文件中每行对之间的相似度或编辑距离?

我想找到文件中包含的最相似的线对,使用类似编辑距离。例如,给定一个文件,内容如下:

What is your favorite color?
What is your favorite food?
Who was the 8th president?
Who was the 9th president?

...它将返回第 3 行和第 4 行作为最相似的线对。

理想情况下,我希望能够计算前 X 条最相似的行。因此,使用上面的示例,第二个最相似的对将是第 1 行和第 2 行。

答案1

我不熟悉 Levenshtein 距离,但 Perl 有用于计算编辑距离的模块,所以我编写了一个简单的 Perl 脚本来计算输入中每​​对线组合的距离,然后以递增的“距离”打印它们,并受到“top X”(N)参数的影响:

#!/usr/bin/perl -w
use strict;
use Text::Levenshtein qw(distance);
use Getopt::Std;

our $opt_n;
getopts('n:');
$opt_n ||= -1; # print all the matches if -n is not provided

my @lines=<>;
my %distances = ();

# for each combination of two lines, compute distance
foreach(my $i=0; $i <= $#lines - 1; $i++) {
  foreach(my $j=$i + 1; $j <= $#lines; $j++) {
        my $d = distance($lines[$i], $lines[$j]);
        push @{ $distances{$d} }, $lines[$i] . $lines[$j];
  }
}

# print in order of increasing distance
foreach my $d (sort { $a <=> $b } keys %distances) {
  print "At distance $d:\n" . join("\n", @{ $distances{$d} }) . "\n";
  last unless --$opt_n;
}

在样本输入上,它给出:

$ ./solve.pl < input
At distance 1:
Who was the 8th president?
Who was the 9th president?

At distance 3:
What is your favorite color?
What is your favorite food?

At distance 21:
What is your favorite color?
Who was the 8th president?
What is your favorite color?
Who was the 9th president?
What is your favorite food?
Who was the 8th president?
What is your favorite food?
Who was the 9th president?

并显示可选参数:

$ ./solve.pl -n 2 < input
At distance 1:
Who was the 8th president?
Who was the 9th president?

At distance 3:
What is your favorite color?
What is your favorite food?

我不确定如何明确地打印输出,但是可以按照您的意愿打印字符串。

相关内容