AWK 与 RS 不匹配模式（再次询问，因为我不小心将其标记为已解决。这次有更好的解释。）

2024-6-8 • tag-icon

command-line bash awk

AWK 与 RS 不匹配模式（再次询问，因为我不小心将其标记为已解决。这次有更好的解释。）

我有一个 odt 文件，文本行之间有空行。我想搜索一个术语并输出与该术语匹配的整个文本组。我的方法是说 odt 文件中的空行是记录分隔符。Odt 文件是 zip 存档，其文本包含在 content.xml 中。解压 odt 文件后，我使用 xmllint --format content.xml 插入换行符（如下所示），“空白”行实际上是 > 和 < 之间没有文本的行。因此，我想将 RS 设置为 > 和 < 之间没有文本的任何行。如果格式化的 content.xml 文件如下：

<long line of alphanumerics, slashes, single and double quotes><more or the same><and many more>
      <office:text>
      <text:sequence-decls>
        <text:sequence-decl text:display-outline-level="0" text:name="Illustration"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Table"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Text"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Drawing"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Figure"/>
      </text:sequence-decls>
      <text:p text:style-name="P1">This is the first line</text:p>
      <text:p text:style-name="P1"/>
      <text:p text:style-name="P1">This is the third line</text:p>
      <text:p text:style-name="P1">and this is some more text that is to be included</text:p>
      <text:p text:style-name="P1"/>
      <text:p text:style-name="P1">This is the sixth. I want it included,</text:p>
      <text:p text:style-name="P1">with this line</text:p>
      <text:p text:style-name="P1">and this one</text:p>
    </office:text>

代码是

$ awk '/line/' RS='\n[ \t]*<[^>]*>\n' file.xml

输出了整个文件。但我只想要：

      <text:p text:style-name="P1">This is the first line</text:p>
      <text:p text:style-name="P1">This is the third line</text:p>
      <text:p text:style-name="P1">and this is some more text that is to be included</text:p>
      <text:p text:style-name="P1">This is the sixth. I want it included,</text:p>
      <text:p text:style-name="P1">with this line</text:p>
      <text:p text:style-name="P1">and this one</text:p>

答案1

你的方法充满了问题。最重要的是，没有明显的方法将正则表达式匹配限制在文档的正文中 - 例如/line/，这将匹配如下标签<text:sequence-decl text:display-outline-level="0" text:name="Illustration"/>

（您的正则表达式也存在RS消耗两个换行符的问题，这将阻止它正确处理相邻的分隔符；RS='\n([ \t]*<[^>]*>\n)+' 可能修复这个问题但我不保证）。

相反，我建议先提取文档的正文，然后然后在“传统”段落模式中应用 awk（即使用空记录分隔符）：

xmlstarlet sel -t -v "//office:body/office:text/text:p" -n content.xml | 
  awk -v RS= '/line/{print $0 ORS}'

或者使用 GNU awk，保留解析后的实际记录分隔符：

xmlstarlet sel -t -v "//office:body/office:text/text:p" -n content.xml | 
  gawk -v RS= '/line/{printf $0 RT}'

您甚至可以完全省略中间文件，从以下位置管道传输标准输出unzip -p：

unzip -p somefile.odt content.xml | 
  xmlstarlet sel -t -v "//office:body/office:text/text:p" -n - | gawk -v RS= '/line/{printf $0 RT}'

答案2

根据 steeldriver 的启发来回答我自己的问题，我在使用 awk 之前修改了文件：

sed '/>.*</! s/.*/---/' test.txt > modfile.txt  # overwrites lines matching the pattern with what I will name as the record separator, “---”

然后我能够提取 $searchterm 匹配的整个记录

awk "/$searchterm/" RS="---" modfile.txt > results.txt

相关内容