搜索三个连续的单词

搜索三个连续的单词

我的书单(txt 文件)中有重复项,如下所示 -

The Ideal Team Player
The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues
Ideal Team Player: Recognize and Cultivate The Three Essential Virtues

Joy on Demand: The Art of Discovering the Happiness Within
Crucial Conversations Tools for Talking When Stakes Are High
Joy on Demand

Search Inside Yourself: The Unexpected Path to Achieving Success, Happiness
Search Inside Yourself
......
......
......

我需要找到重复的书籍,并在检查后手动删除它们。我搜索并发现线条需要图案。

前任。

根据部分行比较删除重复行

查找文件中部分重复的行并计算每行重复的次数?

就我而言,找到线条的模式是很困难的。然而,我在单词序列中发现了一种模式。

我只想将具有三个连续单词的行标记为重复(不区分大小写)

如果你看到你会发现——

The Ideal Team Player
The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues
Ideal Team Player: Recognize and Cultivate The Three Essential Virtues

Ideal Team Player是我正在寻找的连续单词。

我希望输出如下所示 -

3 Ideal Team Player
2 Joy on Demand
2 Search Inside Yourself
......
......
......

我怎样才能做到这一点?

答案1

以下awk程序存储每组三个连续单词出现的次数(删除标点符号后),并在计数大于 1 时打印计数和末尾的单词组:

{
        gsub("[[:punct:]]", "")

        for (i = 3; i <= NF; ++i)
                w[$(i-2),$(i-1),$i]++
}
END {
        for (key in w) {
                count = w[key]
                if (count > 1) {
                        gsub(SUBSEP," ",key)
                        print count, key
                }
        }
}

鉴于您问题中的文本,这会产生

2 Search Inside Yourself
2 Cultivate The Three
2 The Three Essential
2 Joy on Demand
2 Recognize and Cultivate
2 Three Essential Virtues
2 and Cultivate The
2 The Ideal Team
3 Ideal Team Player

正如您所看到的,这可能没那么有用。

相反,我们可以收集相同的计数信息,然后对文件进行第二次遍历,打印包含计数大于 1 的单词三元组的每一行:

NR == FNR {
        gsub("[[:punct:]]", "")

        for (i = 3; i <= NF; ++i)
                w[$(i-2),$(i-1),$i]++

        next
}

{
        orig = $0
        gsub("[[:punct:]]", "")

        for (i = 3; i <= NF; ++i)
                if (w[$(i-2),$(i-1),$i] > 1) {
                        print orig
                        next
                }
}

对您的文件进行测试:

$ cat file
The Ideal Team Player
The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues
Ideal Team Player: Recognize and Cultivate The Three Essential Virtues

Joy on Demand: The Art of Discovering the Happiness Within
Crucial Conversations Tools for Talking When Stakes Are High
Joy on Demand

Search Inside Yourself: The Unexpected Path to Achieving Success, Happiness
Search Inside Yourself
$ awk -f script.awk file file
The Ideal Team Player
The Ideal Team Player: How to Recognize and Cultivate The Three Essential Virtues
Ideal Team Player: Recognize and Cultivate The Three Essential Virtues
Joy on Demand: The Art of Discovering the Happiness Within
Joy on Demand
Search Inside Yourself: The Unexpected Path to Achieving Success, Happiness
Search Inside Yourself

警告:该awk程序需要足够的内存来存储文件文本的大约三倍,并且即使条目实际上并未真正重复,也可能会在常用短语中找到重复项(例如,“如何烹饪”可能是多个标题的一部分)图书)。

答案2

IMO,这个任务最好使用单词集的交集来解决,而不是寻找 3 个连续的单词。

因此,以下 perl 脚本执行以下操作:不是寻找 3 个连续的单词。相反,它首先读取整个输入(从标准输入和/或一个或多个文件)并(使用集合::微小模块)为每个输入行创建一组单词。

然后它第二次处理输入,并且(对于每一行)它打印出在第一遍中读取的任何具有精确重复项或其中的行路口的集合有3 个或更多元素

它使用一个哈希数组来%sets存储每个标题的单词集,并使用另一个哈希来%titles计算它看到每个标题的次数 - 这在输出阶段使用以确保它不会比以前更频繁地打印任何标题在输入中看到。

简而言之,它会相邻打印重复行和相似行(即其中至少有 3 个相同单词的行) - 这 3 个单词不必是连续的。

该脚本在构建集合时会忽略几个非常常见的小单词,但是可以通过注释掉或删除带有注释的行来禁用此功能OPTIONAL...。或者您可以编辑常用单词列表以满足您的需要。

值得一提的是,脚本中的小单词列表中包含了单词by。如果您愿意,可以将其从列表中删除,但它存在的原因是为了阻止脚本匹配byplus上匹配任何另外两个单词 - 例如Aardvark Taxidermy for Personal Wealth by Peter Smith将匹配The Wealth of Nations by Adam Smith(匹配byWealthSmith)。第一本书(我希望)完全不存在,但如果它确实存在,它也将与经济学文本完全无关。

注意:此脚本将整个输入以及每个输入行的关联单词集存储在内存中。对于具有几 GiB 可用 RAM 的现代系统来说,这不太可能成为问题,除非输入非常大。

注2:Set::Tiny针对 Debian 打包为libset-tiny-perl.它也可以为其他发行版预先打包。否则,您可以从上面的 CPAN 链接获取它。

#!/usr/bin/perl -w

use strict;
use Set::Tiny;

# a partial list of common articles, prepositions and small words joined into
# a regex.
my $sw = join("|", qw(
  a about after against all among an and around as at be before between both
  but by can do down during first for from go have he her him how
  I if in into is it its last like me my new of off old
  on or out over she so such that the their there they this through to
  too under up we what when where with without you your)
);

my %sets=();    # word sets for each title.
my %titles=();  # count of how many times we see the same title.

while(<>) {
  chomp;
  # take a copy of the original input line, so we can use it as
  # a key for the hashes later.
  my $orig = $_;

  # "simplify" the input line
  s/[[:punct:]]//g;  #/ strip punctuation characters
  s/^\s*|\s*$//g;    #/ strip leading and trailing spaces
  $_=lc;             #/ lowercase everything, case is not important.
  s/\b($sw)\b//iog;  #/ optional. strip small words
  next if (/^$/);

  $sets{$orig} = Set::Tiny->new(split);
  $titles{$orig}++;
};

my @keys = (sort keys %sets);

foreach my $title (@keys) {
  next unless ($titles{$title} > 0);

  # if we have any exact dupes, print them. and make sure they won't
  # be printed again.
  if ($titles{$title} > 1) {
    print "$title\n" x $titles{$title};
    $titles{$title}  = 0;
  };

  foreach my $key (@keys) {
    next unless ($titles{$key} > 0);
    next if ($key eq $title);

    my $intersect = $sets{$key}->intersection($sets{$title});
    my $k=scalar keys %{ $intersect };

    #print STDERR "====>$k(" . join(",",sort keys %{ $intersect }) . "):$title:$key\n" if ($k > 1);

    if ($k >= 3) {
      print "$title\n" if ($titles{$title} > 0);
      print "$key\n" x $titles{$key};
      $titles{$key}   = 0;
      $titles{$title} = 0;
    };
  };
};

将其另存为,例如blueray.pl,并使其可执行chmod +x

给定新的样本输入,它会产生以下输出:

$ ./blueray.pl TestData.txt 
7L: The Seven Levels of Communication
The Seven Levels of Communication: Go From Relationships to Referrals by Michael J. Maher
A History of Money and Banking in the United States: The Colonial Era to World War II
The History of Banking: The History of Banking and How the World of Finance Became What it is Today
America's Bank: The Epic Struggle to Create the Federal Reserve
America's Money Machine: The Story of the Federal Reserve
Freakonomics: A Rogue Economist
Freakonomics: A Rogue Economist Explores the Hidden
Freakonomics: A Rogue Economist Explores the Hidden
Freakonomics: A Rogue Economist Explores the Hidden Side of Everything by Steven Levitt
Money Master the Game by Tony Robbinson
Money Master the Game by Tony Robbinson
Money Master the Game by Tony Robbinson
The Federal Reserve and its Founders: Money, Politics, and Power
The Power and Independence of the Federal Reserve
Venture Deals by Brad Feld
Venture Deals by Brad Feld & Jason Mendelson
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values

这与您的示例输出不完全相同。因为它检查标题中是否存在常见单词,而忽略它们的确切顺序,所以更有可能发现误报不太可能错过不应错过的比赛(漏报)。

如果您想尝试一下或者只是看看它匹配(或几乎匹配)哪些单词,您可以取消注释该#print STDERR

相关内容