一口气匹配线条并提取内容

Question 1

使用 XML 解析器确实是个好主意，但如果由于某种原因无法使用解析器（文件格式不正确、没有安装任何解析器等），则可以使用 PERL 来实现此目的：

$ perl -ne 'if(/<h2>(.*?)<\/h2><p>(.*?)<\/p>/){print "$1\t$2\n"}' filename.ext
Hello   World
Bells   Walls
Jelly   Minus

我更喜欢使用惰性匹配，这样我就不会得到意想不到的结果：

测试.txt

<h1>Nothing</h1>
<h2>Hello</h2><p>World</p><h2>Goodbye</h2><p>Earth</p>
<h2>Bells</h2><p>Walls</p>
<h2>Jelly</h2><p>Minus</p>
<h3>Zip</h3>

$ perl -ne 'if(/<h2>(.*?)<\/h2><p>(.*?)<\/p>/){print "$1\t$2\n"}' test.txt
Hello   World
Bells   Walls
Jelly   Minus
$ perl -ne 'if(/<h2>(.*)<\/h2><p>(.*)<\/p>/){print "$1\t$2\n"}' test.txt
Hello</h2><p>World</p><h2>Goodbye       Earth
Bells   Walls
Jelly   Minus

正如您所看到的，仅使用正则表达式无法获得特定于域的工具所能获得的所有情况。如果您对此表示同意，那就没问题；请注意，如果输入与您的模式不完全匹配，您可能会得到不准确的结果！

Answer

使用 XML 解析器确实是个好主意，但如果由于某种原因无法使用解析器（文件格式不正确、没有安装任何解析器等），则可以使用 PERL 来实现此目的：

$ perl -ne 'if(/<h2>(.*?)<\/h2><p>(.*?)<\/p>/){print "$1\t$2\n"}' filename.ext
Hello   World
Bells   Walls
Jelly   Minus

我更喜欢使用惰性匹配，这样我就不会得到意想不到的结果：

测试.txt

<h1>Nothing</h1>
<h2>Hello</h2><p>World</p><h2>Goodbye</h2><p>Earth</p>
<h2>Bells</h2><p>Walls</p>
<h2>Jelly</h2><p>Minus</p>
<h3>Zip</h3>

$ perl -ne 'if(/<h2>(.*?)<\/h2><p>(.*?)<\/p>/){print "$1\t$2\n"}' test.txt
Hello   World
Bells   Walls
Jelly   Minus
$ perl -ne 'if(/<h2>(.*)<\/h2><p>(.*)<\/p>/){print "$1\t$2\n"}' test.txt
Hello</h2><p>World</p><h2>Goodbye       Earth
Bells   Walls
Jelly   Minus

正如您所看到的，仅使用正则表达式无法获得特定于域的工具所能获得的所有情况。如果您对此表示同意，那就没问题；请注意，如果输入与您的模式不完全匹配，您可能会得到不准确的结果！

Question 2

正确的方法是与xmlstarlet工具（用于解析xml/html数据）：

xmlstarlet sel -t -m '//h2' -v 'concat(., "'$'\t''", ./following-sibling::p)' -n file

输出：

Hello   World
Bells   Walls
Jelly   Minus

Answer

正确的方法是与xmlstarlet工具（用于解析xml/html数据）：

xmlstarlet sel -t -m '//h2' -v 'concat(., "'$'\t''", ./following-sibling::p)' -n file

输出：

Hello   World
Bells   Walls
Jelly   Minus

Question 3

对于您使用的包含未加引号的正则表达式()，需要扩展正则表达式语法（或将每个(and替换)为\(and \)）。这很简单。

并且，可能会避免使用过多的贪婪匹配[^<]而不是点。

当然，您可以设置一个变量并使用引号进行操作仅有的sed：

$ a='<h2>([^<]*)<\/h2><p>([^<]*)<\/p>'                                                                    
$ sed -nE '/'"$a"'/s/'"$a"'/\1 \2/p' infile

但它会变得更好，因为这可以简化。 Sed 会记住最后使用的正则表达式，左侧s//(空) 就足够了。

$ sed -nE '/'"$a"'/s//\1 \2/p' infile

或者，没有变量：

$ sed -nE '/<h2>([^<]*)<\/h2><p>([^<]*)<\/p>/s//\1 \2/p' infile
Hello World
Bells Walls
Jelly Minus

Answer

对于您使用的包含未加引号的正则表达式()，需要扩展正则表达式语法（或将每个(and替换)为\(and \)）。这很简单。

并且，可能会避免使用过多的贪婪匹配[^<]而不是点。

当然，您可以设置一个变量并使用引号进行操作仅有的sed：

$ a='<h2>([^<]*)<\/h2><p>([^<]*)<\/p>'                                                                    
$ sed -nE '/'"$a"'/s/'"$a"'/\1 \2/p' infile

但它会变得更好，因为这可以简化。 Sed 会记住最后使用的正则表达式，左侧s//(空) 就足够了。

$ sed -nE '/'"$a"'/s//\1 \2/p' infile

或者，没有变量：

$ sed -nE '/<h2>([^<]*)<\/h2><p>([^<]*)<\/p>/s//\1 \2/p' infile
Hello World
Bells Walls
Jelly Minus

Question 4

可能的解决方案通过sed：

sed 's/<[^13>]*>/ /g' test | sed 's/<h[13]>.*<\/h[13]>//' <file>

 Hello  World
 Bells  Walls
 Jelly  Minus

其次sed只是删除不必要的标签（<h1>或<h3>）。

图案说明：

/<[^13>]*>/ /- 搜索*文本中以开头<和结尾的任何符号>。但在标签之间的符号1或3 一定不（^）出席。

Answer

可能的解决方案通过sed：

sed 's/<[^13>]*>/ /g' test | sed 's/<h[13]>.*<\/h[13]>//' <file>

 Hello  World
 Bells  Walls
 Jelly  Minus

其次sed只是删除不必要的标签（<h1>或<h3>）。

图案说明：

/<[^13>]*>/ /- 搜索*文本中以开头<和结尾的任何符号>。但在标签之间的符号1或3 一定不（^）出席。

一口气匹配线条并提取内容

答案1

答案2

答案3

答案4

相关内容