我想在 HTML 文件的所有行中搜索一个字符串。如果找到该字符串,则<title>.*</title>
输出该字符串和文件的标签。搜索必须递归执行。
<title>.*</title>
只有文件包含搜索的字符串时,才会输出文件所包含的标签。
示例输入文件 1:
<!DOCTYPE html>
<html>
<head>
<title>Title of website 1</title>
</head>
<body>
the character string to be found
</body>
</html>
示例输入文件 2:
<!DOCTYPE html>
<html>
<head>
<title>Title of website 2</title>
</head>
<body>
the character string that shall NOT be found
</body>
</html>
示例输入文件 3:
<!DOCTYPE html>
<html>
<head>
<title>Title of website 3</title>
</head>
<body>
the character string to be found
</body>
</html>
示例输出:
<title>Title of website 1</title>
the character string to be found
<title>Title of website 3</title>
the character string to be found
非常感谢您的帮助!
答案1
由于您正在处理 HTML 文档,我建议使用进行结构化文档查询的工具,而不是将输入视为简单文本。例如,假设somedir
包含上述示例文档:
$ ls somedir
file1.html file2.html file3.html
随后这个答案您可以使用xmllint
解析器开关来查找所有元素包含--html
的节点,并输出其元素的:body
the character string to be found
title
head
$ find somedir/ -name '*.html' -exec xmllint --html --xpath '
//*[body[contains(.,"the character string to be found")]]/head/title
' {} \;
<title>Title of website 3</title>
XPath set is empty
<title>Title of website 1</title>
请注意,对于 XPath 查询不匹配的文件,xmllint
会将消息打印到标准错误流,但也会以非零状态退出。您可以丢弃前者,但使用后者有条件地打印搜索字符串:
$ find somedir/ -name '*.html' -exec xmllint --html --xpath '
//*[body[contains(.,"the character string to be found")]]/head/title
' {} 2>/dev/null \; -printf 'the character string to be found\n'
<title>Title of website 3</title>
the character string to be found
<title>Title of website 1</title>
the character string to be found
或者如果你想打印实际的正文其中找到字符串后,您可以有条件地执行第二个查询:
$ find somedir/ -name '*.html' -exec xmllint --html --xpath '
//*[body[contains(.,"the character string to be found")]]/head/title
' {} 2>/dev/null \; -exec xmllint --html --xpath '
//*[body[contains(.,"the character string to be found")]]/body/text()
' {} \;
<title>Title of website 3</title>
the character string to be found
<title>Title of website 1</title>
the character string to be found
(请注意,正文中的换行符不会被删除)。如果您有支持 XPath 2.0 或更高版本的工具,例如西德尔然后,您可以使用该函数在一次调用中组合匹配的元素concat()
(尽管我还没有找到一种方法让它将一个元素输出为 HTML 标签,将另一个元素输出为纯文本):
$ find somedir/ -name '*.html' -exec ./xidel --silent --xpath '
//*[body[contains(.,"the character string to be found")]]/concat(./head/title, codepoints-to-string(10), ./body)
' {} \;
Title of website 3
the character string to be found
Title of website 1
the character string to be found