修改grep的搜索结果

Question

当您转换 PDF 文件时，pdftotext元信息会丢失。但是，pdftotext有一个有趣的选项：

-htmlmeta
       Generate a simple HTML file, including the meta information.  This simply wraps the 
       text in <pre> and </pre> and  prepends the meta headers.

现在，您也可以 grep 获取元信息：

pdftotext -htmlmeta file.pdf - | \
  grep -oP '.*keyword.*|<title>\K.*(?=</title>)|<meta name="Author" content="\K.*(?="/>)'

keyword这将在 PDF 文件中搜索。然后|将从文档中提取出另外 2 个搜索模式：文档的标题和作者。结果如下：

title of the document
author of the document
search pattern

或者使用perl，它可以在匹配后格式化文本，这与grep:

pdftotext -htmlmeta file.pdf - | perl -ne '/keyword/ && print "Pattern: $_"; /<title>(.*)<\/title>/ && print "Title: $1\n"; /<meta name="Author" content="([^"]+)/ && print "Author: $1\n"'

输出如下：

Title: title of the document
Author: author of the document
Pattern: bla bla search pattern bla bla

Answer 1

当您转换 PDF 文件时，pdftotext元信息会丢失。但是，pdftotext有一个有趣的选项：

-htmlmeta
       Generate a simple HTML file, including the meta information.  This simply wraps the 
       text in <pre> and </pre> and  prepends the meta headers.

现在，您也可以 grep 获取元信息：

pdftotext -htmlmeta file.pdf - | \
  grep -oP '.*keyword.*|<title>\K.*(?=</title>)|<meta name="Author" content="\K.*(?="/>)'

keyword这将在 PDF 文件中搜索。然后|将从文档中提取出另外 2 个搜索模式：文档的标题和作者。结果如下：

title of the document
author of the document
search pattern

或者使用perl，它可以在匹配后格式化文本，这与grep:

pdftotext -htmlmeta file.pdf - | perl -ne '/keyword/ && print "Pattern: $_"; /<title>(.*)<\/title>/ && print "Title: $1\n"; /<meta name="Author" content="([^"]+)/ && print "Author: $1\n"'

输出如下：

Title: title of the document
Author: author of the document
Pattern: bla bla search pattern bla bla

修改grep的搜索结果

答案1

相关内容