挖掘纯文本

2024-7-19 • tag-icon

search grep plaintext

挖掘纯文本

（原标题：grep 是按段落而不是行进行的）

这个问题的动机是fzf，它允许我在巨大的文件系统中找到某个文件模糊地和渐进地，提供非常快速的搜索体验（请参阅此中的大量可爱 gif文章）。

我想对我的笔记做类似的事情。我有很多纯文本格式的临时笔记、日记、备忘录等。对于可读性grep，每行不超过 72 个字符。基于我对现有搜索工具（如、ripgrep等）的了解，这使我的笔记很难被搜索到。

现在，你可以展示（更多的/较少的) 上下文围绕匹配的模式，但这不是我想要的。这里我举了一个例子来使其更精确。

1  Victim mentality is an acquired personality trait in which a person
2  tends to recognize or consider themselves as a victim of the negative
3  actions of others, and to behave as if this were the case in the face
4  of contrary evidence of such circumstances. Victim mentality depends
5  on clear thought processes and attribution.
6
7  (from wikipedia: Victim mentality)

假设我半年前写了这个笔记，我知道它在我的文件系统的某个地方。像往常一样，我们无法背诵确切的单词，但我们记得上下文!在我的文件系统中输入诸如、或之类grep的文本可能会给我personalityclear thoughtvictim也有很多相关的事情让我真正缩小范围。

应该有一个工具（无论是否存在）可以帮助搜索这样的文本。我们的旧笔记（纯文本）将更有价值。有没有办法和我们的老朋友grep及其亲戚一起做这件事？或者还有其他可行的方法吗？任何意见都非常感谢。

答案1

让我们将（搜索）过程分解成更小的部分。

首先，我们需要获取要搜索的文件列表，例如在当前目录 ( .) 中所有带有 txt 扩展名的文件 ( -name "*.txt")，这些文件肯定是文件 ( -type f)：

find . -name "*.txt" -type f

这个结果可以用作输入来在这些文件内grep查找something，包括行号和文件名到输出中，忽略大小写（-nHi），+最后确保所有文件都在一次执行中被 grep（而不是一次一个）：

find . -name "*.txt" -type f -exec grep -nHi 'something' {} +

如果文件数太大（> $ARG_MAX），则应替换+和\;。

上一个命令的输出类似于：

./some/dir/somewhere/songs.txt:128:But had me believing it was always something that I'd done
./some/dir/somewhere/songs.txt:883:Was never something I pursued
./some/dir/somewhere/songs.txt:2905:I know something about love 
./some/dir/somewhere/songs_other.txt:11780:will come across something like this:  F (Dshape).

因此，如果您:将这些行拆分为 3 个部分：文件名、找到匹配项的行号和行本身。

现在，如果您为每个匹配的文件保留此信息，则可以搜索下一个术语并求和匹配距离以找到搜索术语最接近的文件。

对于示例文本，如果您搜索 3 个术语（，，personality），您将获得相应的行号 1、5 和 2，因此该文件的距离是（从第一个术语开始）clear thoughtvictim

abs(1-5) + abs(1-2) = 5

因此，您可以根据包含所有术语的文件以及它们在该文件中最接近的位置对文件进行排序。

当然，这不是完整的情况，例如一些文件多次包含相同的术语，并且该算法必须做出如何计算距离的决定，但我认为以上是一个开始。

答案2

一个简单的 Perl 单行代码就可以完成这项工作。如果所有关键字（即personality和clear thought和victim）都存在于文件中，则以下代码将打印文件名，后跟“found”。

perl -0777 -ane 'print "$ARGV: found\n" if /^(?=.*personality)(?=.*clear thought)(?=.*victim)/s' file.txt

输出：

file.txt: found

解释：

-0777       # slurp mode
-ane        # read the file ans execute the following
print "$ARGV: found\n"      # print,$ARGV contains the current filename
if                          # if
  /                         # regex delimiter
    ^                       # begining of file
      (?=.*personality)     # positive lookahead, make sure we have "personality"
      (?=.*clear thought)   # positive lookahead, make sure we have "clear thought"
      (?=.*victim)          # positive lookahead, make sure we have "victim"
  /s                        # regex delimiter, s = dot matches newline

如果要在所有 txt 文件中搜索，请使用perl ...... *.txt

相关内容