从命令行使用停用词列表在文件中查找 n 个最常见的单词

Question 1

考虑这个测试文件：

$ cat text.txt
this file has "many" words, some
with punctuation.  some repeat,
many do not.

要获取字数：

$ grep -oE '[[:alpha:]]+' text.txt | sort | uniq -c | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 this
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

怎么运行的

grep -oE '[[:alpha:]]+' text.txt

这将返回所有单词，减去任何空格或标点符号，每行一个单词。
sort

这会将单词按字母顺序排序。
uniq -c

这会计算每个单词出现的次数。（为了uniq工作，其输入必须经过排序。）
sort -nr

这会按数字方式对输出进行排序，以便最常见的单词位于顶部。

处理混合情况

考虑这个混合大小写的测试文件：

$ cat Text.txt
This file has "many" words, some
with punctuation.  Some repeat,
many do not.

如果我们想将some和计算Some为相同：

$ grep -oE '[[:alpha:]]+' Text.txt | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 This
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

在这里，我们添加了-f选项，sort以便它会忽略大小写，并-i添加选项，uniq以便它也会忽略大小写。

排除停用词

假设我们要从计数中排除这些停用词：

$ cat stopwords 
with
not
has
do

因此，我们添加grep -v以消除这些词：

$ grep -oE '[[:alpha:]]+' Text.txt | grep -vwFf stopwords | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 This
      1 repeat
      1 punctuation
      1 file

Answer

考虑这个测试文件：

$ cat text.txt
this file has "many" words, some
with punctuation.  some repeat,
many do not.

要获取字数：

$ grep -oE '[[:alpha:]]+' text.txt | sort | uniq -c | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 this
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

怎么运行的

grep -oE '[[:alpha:]]+' text.txt

这将返回所有单词，减去任何空格或标点符号，每行一个单词。
sort

这会将单词按字母顺序排序。
uniq -c

这会计算每个单词出现的次数。（为了uniq工作，其输入必须经过排序。）
sort -nr

这会按数字方式对输出进行排序，以便最常见的单词位于顶部。

处理混合情况

考虑这个混合大小写的测试文件：

$ cat Text.txt
This file has "many" words, some
with punctuation.  Some repeat,
many do not.

如果我们想将some和计算Some为相同：

$ grep -oE '[[:alpha:]]+' Text.txt | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 with
      1 This
      1 repeat
      1 punctuation
      1 not
      1 has
      1 file
      1 do

在这里，我们添加了-f选项，sort以便它会忽略大小写，并-i添加选项，uniq以便它也会忽略大小写。

排除停用词

假设我们要从计数中排除这些停用词：

$ cat stopwords 
with
not
has
do

因此，我们添加grep -v以消除这些词：

$ grep -oE '[[:alpha:]]+' Text.txt | grep -vwFf stopwords | sort -f | uniq -ic | sort -nr
      2 some
      2 many
      1 words
      1 This
      1 repeat
      1 punctuation
      1 file

Question 2

命令：

猫文本.txt | tr ' ' '\n' | grep -v '单词\|word2' |排序| uniq-c|排序-nk1

这是如何运作的

以下是文件内容

$猫文件.txt

Lorem Ipsum 是印刷和排版行业的简单虚拟文本。自 1500 年代以来，Lorem Ipsum 一直是行业标准的虚拟文本，当时一位不知名的印刷商拿走了一堆字体并将其打乱以制作一本字体样本簿。

$ cat file.txt|tr ' ' '\n'| grep -v -w 'an\|a\|is'|排序| uniq-c|排序-nk1|尾部
      1 未知
      1 当
      2和
      2 个假人
      2 伊普苏姆
      2 洛雷姆
      2 个
      2 文字
      2型
      3 的

描述：换行换行，然后从列表中剔除单词，然后排序并统计常用的单词

Answer

命令：

猫文本.txt | tr ' ' '\n' | grep -v '单词\|word2' |排序| uniq-c|排序-nk1

这是如何运作的

以下是文件内容

$猫文件.txt

Lorem Ipsum 是印刷和排版行业的简单虚拟文本。自 1500 年代以来，Lorem Ipsum 一直是行业标准的虚拟文本，当时一位不知名的印刷商拿走了一堆字体并将其打乱以制作一本字体样本簿。

$ cat file.txt|tr ' ' '\n'| grep -v -w 'an\|a\|is'|排序| uniq-c|排序-nk1|尾部
      1 未知
      1 当
      2和
      2 个假人
      2 伊普苏姆
      2 洛雷姆
      2 个
      2 文字
      2型
      3 的

描述：换行换行，然后从列表中剔除单词，然后排序并统计常用的单词

从命令行使用停用词列表在文件中查找 n 个最常见的单词

答案1

怎么运行的

处理混合情况

排除停用词

答案2

这是如何运作的

相关内容