awk：解析并写入另一个文件

Question 1

我假设您发布的内容是一个示例，因为它不是有效的 XML。如果这个假设无效，我的答案就不成立……但如果是这样的话，您确实需要用 XML 规范的汇总副本来联系向您提供 XML 的人，并要求他们“修理它'。

但实际上，awk正则表达式并不是完成这项工作的正确工具。 XML 解析器是。有了解析器，做你想做的事情就变得异常简单：

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig; 

#parse your file - this will error if it's invalid. 
my $twig = XML::Twig -> new -> parsefile ( 'your_xml' );
#set output format. Optional. 
$twig -> set_pretty_print('indented_a');

#iterate all the 'record' nodes off the root. 
foreach my $record ( $twig -> get_xpath ( './record' ) ) {
   #if - beneath this record - we have a node anywhere (that's what // means)
   #with a tag of 'keyword' and content of 'SEARCH' 
   #print the whole record. 
   if ( $record -> get_xpath ( './/keyword[string()="SEARCH"]' ) ) {
       $record -> print;
   }
}

xpath在某些方面很像正则表达式，但它更像是目录路径。这意味着它具有上下文感知能力，并且可以处理 XML 结构。

在上面：./表示“当前节点下方”，因此：

$twig -> get_xpath ( './record' )

表示任何“顶级”<record>标签。

但.//意味着“在当前节点以下的任何级别”，因此它将递归地执行此操作。

$twig -> get_xpath ( './/search' )

将获得<search>任何级别的任何节点。

方括号表示一个条件 - 它可以是一个函数（例如text()获取节点的文本），也可以使用一个属性。例如，//category[@name]将找到任何具有名称属性的类别，并//category[@name="xyz"]进一步过滤它们。

用于测试的 XML：

<XML>
<record category="xyz">
<person ssn="" e-i="E">
<title xsi:nil="true"/>
<position xsi:nil="true"/>
<details>
<names>
<first_name/>
<last_name></last_name>
</names>
<aliases>
<alias>CDP</alias>
</aliases>
<keywords>
<keyword xsi:nil="true"/>
<keyword>SEARCH</keyword>
</keywords>
<external_sources>
<uri>http://www.google.com</uri>
<detail>SEARCH is present in abc for xyz reason</detail>
</external_sources>
</details>
</person>
</record>
<record category="abc">
<person ssn="" e-i="F">
<title xsi:nil="true"/>
<position xsi:nil="true"/>
<details>
<names>
<first_name/>
<last_name></last_name>
</names>
<aliases>
<alias>CDP</alias>
</aliases>
<keywords>
<keyword xsi:nil="true"/>
<keyword>DONTSEARCH</keyword>
</keywords>
<external_sources>
<uri>http://www.google.com</uri>
<detail>SEARCH is not present in abc for xyz reason</detail>
</external_sources>
</details>
</person>
</record>
</XML>

输出：

 <record category="xyz">
    <person
        e-i="E"
        ssn="">
      <title xsi:nil="true" />
      <position xsi:nil="true" />
      <details>
        <names>
          <first_name/>
          <last_name></last_name>
        </names>
        <aliases>
          <alias>CDP</alias>
        </aliases>
        <keywords>
          <keyword xsi:nil="true" />
          <keyword>SEARCH</keyword>
        </keywords>
        <external_sources>
          <uri>http://www.google.com</uri>
          <detail>SEARCH is present in abc for xyz reason</detail>
        </external_sources>
      </details>
    </person>
  </record>

注意 - 上面只是将记录打印到 STDOUT。实际上……在我看来，这并不是一个好主意。尤其是因为 - 它不会打印 XML 结构，因此如果您有多个记录（没有“根”节点），它实际上不是“有效”XML。

所以我会 - 完全完成你所要求的：

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig; 

my $twig = XML::Twig -> new -> parsefile ('your_file.xml'); 
$twig -> set_pretty_print('indented_a');

foreach my $record ( $twig -> get_xpath ( './record' ) ) {
   if ( not $record -> findnodes ( './/keyword[string()="SEARCH"]' ) ) {
       $record -> delete;
   }
}

open ( my $output, '>', "output.txt" ) or die $!;
print {$output} $twig -> sprint;
close ( $output );

相反，这会反转逻辑，并删除（从内存中解析的数据结构中）您的记录不想要，并将整个新结构（包括 XML 标头）打印到名为“output.txt”的新文件中。

Answer