在Html文件中搜索一个字符串并输出此字符串和.*文件的标签

在Html文件中搜索一个字符串并输出此字符串和.*文件的标签

我想在 HTML 文件的所有行中搜索一个字符串。如果找到该字符串,则<title>.*</title>输出该字符串和文件的标签。搜索必须递归执行。

<title>.*</title>只有文件包含搜索的字符串时,才会输出文件所包含的标签。

示例输入文件 1:

<!DOCTYPE html>
<html>
  <head>
    <title>Title of website 1</title>
  </head>
  <body>
the character string to be found
  </body>
</html>

示例输入文件 2:

<!DOCTYPE html>
<html>
  <head>
    <title>Title of website 2</title>
  </head>
  <body>
the character string that shall NOT be found
  </body>
</html>

示例输入文件 3:

<!DOCTYPE html>
<html>
  <head>
    <title>Title of website 3</title>
  </head>
  <body>
the character string to be found
  </body>
</html>

示例输出:

<title>Title of website 1</title>
the character string to be found
<title>Title of website 3</title>
the character string to be found

非常感谢您的帮助!

答案1

由于您正在处理 HTML 文档,我建议使用进行结构化文档查询的工具,而不是将输入视为简单文本。例如,假设somedir包含上述示例文档:

$ ls somedir
file1.html  file2.html  file3.html

随后这个答案您可以使用xmllint解析器开关来查找所有元素包含--html的节点,并输出其元素的:bodythe character string to be foundtitlehead

$ find somedir/ -name '*.html' -exec xmllint --html --xpath '
    //*[body[contains(.,"the character string to be found")]]/head/title
  ' {} \;
<title>Title of website 3</title>
XPath set is empty
<title>Title of website 1</title>

请注意,对于 XPath 查询不匹配的文件,xmllint会将消息打印到标准错误流,但也会以非零状态退出。您可以丢弃前者,但使用后者有条件地打印搜索字符串:

$ find somedir/ -name '*.html' -exec xmllint --html --xpath '
    //*[body[contains(.,"the character string to be found")]]/head/title
  ' {} 2>/dev/null \; -printf 'the character string to be found\n'
<title>Title of website 3</title>
the character string to be found
<title>Title of website 1</title>
the character string to be found

或者如果你想打印实际的正文其中找到字符串后,您可以有条件地执行第二个查询:

$ find somedir/ -name '*.html' -exec xmllint --html --xpath '
    //*[body[contains(.,"the character string to be found")]]/head/title
  ' {} 2>/dev/null \; -exec xmllint --html --xpath '
    //*[body[contains(.,"the character string to be found")]]/body/text()
  ' {} \;
<title>Title of website 3</title>

the character string to be found
  
<title>Title of website 1</title>

the character string to be found
  

(请注意,正文中的换行符不会被删除)。如果您有支持 XPath 2.0 或更高版本的工具,例如西德尔然后,您可以使用该函数在一次调用中组合匹配的元素concat()(尽管我还没有找到一种方法让它将一个元素输出为 HTML 标签,将另一个元素输出为纯文本):

$ find somedir/ -name '*.html' -exec ./xidel --silent --xpath '
    //*[body[contains(.,"the character string to be found")]]/concat(./head/title, codepoints-to-string(10), ./body)
  ' {} \;
Title of website 3
the character string to be found
Title of website 1
the character string to be found

相关内容