sed regex 无法捕获包含模式的整个段落

Question 1

这不能按您的预期工作的原因是，<并且>不需要在正则表达式中转义，它们没有任何特殊含义。然而，\<并且\> 做对于 GNU 扩展正则表达式（使用激活-E）有特殊含义：它们在单词边界处匹配。\<匹配单词的开头和\>结尾。所以\<(This实际上并不匹配 the <，而是匹配单词的开头This。\>最后的同样如此。 GNUsed手册有一个例子这几乎正是你所追求的：

$ sed -En '/./{H;1h;$!d} ; x; s/(<This.*2020.*?>)/\1/p;' file
<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/>

我发现sed特别不适合这种任务。我会用perl：

$ perl -000 -ne 'chomp;/<.*2020.*?>/s && print "$_\n"; exit' file
<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/>

在这里，我们在“段落模式”( ) 下使用 Perl，-000这意味着“行”由两个连续的\n字符（即一个空行）定义。该脚本将：

chomp：删除“行”（段落）末尾的尾随换行符。
/<.*2020.*?>/s && print "$_\n"：如果此“行”（段落）匹配<0 个或多个字符，然后2020匹配 0 个或多个字符，然后匹配>，则打印此行并附加换行符 ( print "$_\n")。s匹配运算符的修饰符允许匹配.换行符。

另一种选择是awk：

$ awk 'BEGIN{RS="\n\n"} /<.*2020.+?>/' file
<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/>

我们将记录分隔符设置RS为两个连续的换行符，然后使用与上面相同的正则表达式进行匹配。由于awk在找到匹配项（或任何其他操作返回 true）时的默认行为是打印当前记录，因此这将打印出您需要的内容。

Answer

这不能按您的预期工作的原因是，<并且>不需要在正则表达式中转义，它们没有任何特殊含义。然而，\<并且\> 做对于 GNU 扩展正则表达式（使用激活-E）有特殊含义：它们在单词边界处匹配。\<匹配单词的开头和\>结尾。所以\<(This实际上并不匹配 the <，而是匹配单词的开头This。\>最后的同样如此。 GNUsed手册有一个例子这几乎正是你所追求的：

$ sed -En '/./{H;1h;$!d} ; x; s/(<This.*2020.*?>)/\1/p;' file
<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/>

我发现sed特别不适合这种任务。我会用perl：

$ perl -000 -ne 'chomp;/<.*2020.*?>/s && print "$_\n"; exit' file
<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/>

在这里，我们在“段落模式”( ) 下使用 Perl，-000这意味着“行”由两个连续的\n字符（即一个空行）定义。该脚本将：

chomp：删除“行”（段落）末尾的尾随换行符。
/<.*2020.*?>/s && print "$_\n"：如果此“行”（段落）匹配<0 个或多个字符，然后2020匹配 0 个或多个字符，然后匹配>，则打印此行并附加换行符 ( print "$_\n")。s匹配运算符的修饰符允许匹配.换行符。

另一种选择是awk：

$ awk 'BEGIN{RS="\n\n"} /<.*2020.+?>/' file
<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/>

我们将记录分隔符设置RS为两个连续的换行符，然后使用与上面相同的正则表达式进行匹配。由于awk在找到匹配项（或任何其他操作返回 true）时的默认行为是打印当前记录，因此这将打印出您需要的内容。

Question 2

首先，大多数文本处理工具（例如sed或）awk都是逐行工作的，因此匹配整个段落需要一些额外的努力。这是可能的，但这也是您看到意外输出的原因之一。

其次，由于 XML 标记分隔字符，您的输入看起来像结构化文本。因此，最好使用xmlstarlet或其他专用工具对其进行处理。（更新：既然您现在在评论中确认了这一点，我强烈建议使用xmlstarlet或类似的工具。）

也就是说，如果您的文本与示例中类似，并且您的安装awk接受多字符记录分隔符（如 GNU Awk），则以下程序应该可以工作：

awk -v RS="<|/>" '/2020/' input.txt

如果变量RS包含多个字符，则将被解释为正则表达式，因此 a<或 a/>将被视为“记录分隔符”，而不是默认的\n。因此，任何匹配条件都将应用于这些标记之间的整个文本，而不仅仅是单独的行。

结果：

This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2

请注意，“tag-open”<和“tag-close”/>字符组合将从输出中删除，因为它们被选为记录分隔符。另一方面，这意味着如果“段落”不由空行分隔，它也将起作用。（但是，如果在此类标签之外存在与您的模式匹配的“杂散”文本，它也会被匹配。）

您可以将要查找的正则表达式放在/ ... /程序的一部分中（就像在sed地址语句中一样）。但是，如果您正在寻找固定字符串，我建议

awk -v RS="<|/>" 'index($0,"2020")' input.txt

反而。

Answer

首先，大多数文本处理工具（例如sed或）awk都是逐行工作的，因此匹配整个段落需要一些额外的努力。这是可能的，但这也是您看到意外输出的原因之一。

其次，由于 XML 标记分隔字符，您的输入看起来像结构化文本。因此，最好使用xmlstarlet或其他专用工具对其进行处理。（更新：既然您现在在评论中确认了这一点，我强烈建议使用xmlstarlet或类似的工具。）

也就是说，如果您的文本与示例中类似，并且您的安装awk接受多字符记录分隔符（如 GNU Awk），则以下程序应该可以工作：

awk -v RS="<|/>" '/2020/' input.txt

如果变量RS包含多个字符，则将被解释为正则表达式，因此 a<或 a/>将被视为“记录分隔符”，而不是默认的\n。因此，任何匹配条件都将应用于这些标记之间的整个文本，而不仅仅是单独的行。

结果：

This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2

请注意，“tag-open”<和“tag-close”/>字符组合将从输出中删除，因为它们被选为记录分隔符。另一方面，这意味着如果“段落”不由空行分隔，它也将起作用。（但是，如果在此类标签之外存在与您的模式匹配的“杂散”文本，它也会被匹配。）

您可以将要查找的正则表达式放在/ ... /程序的一部分中（就像在sed地址语句中一样）。但是，如果您正在寻找固定字符串，我建议

awk -v RS="<|/>" 'index($0,"2020")' input.txt

反而。

Question 3

假设一个格式良好的 XML 文档如下所示：

<root>
<thing  year="2019"
        month="1"
        day="1" />
<thing  year="2020"
        month="5"
        day="13" />
<thing  year="2021"
        month="7"
        day="3" />
</root>

您可以使用以下命令提取属性中thing具有值的每个节点的副本：2020yearxmlstarlet

$ xmlstarlet sel -t -c '//thing[@year = "2020"]' -nl file
<thing year="2020" month="5" day="13"/>

请注意，节点内及其属性之间的空格与文档的内容无关。

Answer

假设一个格式良好的 XML 文档如下所示：

<root>
<thing  year="2019"
        month="1"
        day="1" />
<thing  year="2020"
        month="5"
        day="13" />
<thing  year="2021"
        month="7"
        day="3" />
</root>

您可以使用以下命令提取属性中thing具有值的每个节点的副本：2020yearxmlstarlet

$ xmlstarlet sel -t -c '//thing[@year = "2020"]' -nl file
<thing year="2020" month="5" day="13"/>

请注意，节点内及其属性之间的空格与文档的内容无关。

Question 4

使用 Raku（以前称为 Perl_6）

这是两个答案，受到本线程中其他答案的启发。第一个答案分为段落（受到 @terdon 和 @AdminBee 的启发），然后greps 表示正确的年份：

raku -e 'slurp.split("\n\n").grep(/2020/).put;'

结果：

<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/>

根据 Larry Wall 的说法，Raku 提供的功能使人们可以轻松地从语言内部执行更多操作，从而减少对专用命令行开关的依赖。请参阅“技巧#2”：

https://www.nntp.perl.org/group/perl.perl6.users/2020/07/msg9004.html

第二种方法是使用 Raku 的comb例程，它采用正则表达式“匹配器”并将文本分解为匹配之外的元素（对于进一步处理很有用）。正如 Raku 文档所描述的那样comb：“搜索并返回$matcher最多不重叠的匹配项。”$inputSeq$limit

raku -e '.put for slurp.comb(/^^ "<This" .*? "/>" $$ / ).grep(/2020/);'

结果：

<This is a line of text with a year=2020 month=12 in it
This line of text does not have a year or month in it
This year=2021 is the current year the current month=1
This is the year=2021 the month=2/>

上面的代码在 a 之前中断^^ 行首断言和之后$$ 行结束断言。默认情况下，.点通配符匹配 Raku 中的空格（包括换行符），因此comb上面可以将文本分成多行块（元素）。

显然，对真实 XML 文档最令人满意的结果将是使用专用XML工具和/或库，例如带有社区支持XML模块的 Raku：

https://github.com/raku-community-modules/XML
https://raku.org/

Answer