我有一个像 5GB 这样的大文件,带有.gz
.在该文件中,我们有几个 XML 文件,其中包含我想要搜索和提取的值,以防万一这些值存在。
例如,我想从文件中提取包含名称的标签NOOSS
以及该标签的子内容,例如 <pmJobId>
, <requestedJobState>
, <reportingPeriod>
,<jobPriority>
.gz
<Pm xmlns="urnCmwPm">
<pmId>1</pmId>
<PmJob>
<pmJobId>NOOSSCONTROLExample</pmJobId>
<requestedJobState>ACTIVE</requestedJobState>
<reportingPeriod>FIVE_MIN</reportingPeriod>
<jobType>MEASUREMENTJOB</jobType>
<jobPriority>HIGH</jobPriority>
<granularityPeriod>FIVE_MIN</granularityPeriod>
<jobGroup>Sla</jobGroup>
<reportContentGeneration>CHANGED_ONLY</reportContentGeneration>
<MeasurementReader>
<measurementReaderId>mr_2</measurementReaderId>
<measurementSpecification struct="MeasurementSpecification">
<measurementTypeRef>Anything</measurementTypeRef>
</measurementSpecification>
<thresholdRateOfVariation>PER_SECOND</thresholdRateOfVariation>
</MeasurementReader>
<MeasurementReader>
<measurementReaderId>mr_1</measurementReaderId>
<measurementSpecification struct="MeasurementSpecification">
<measurementTypeRef>ManagedElement=1,SystemFunctions=1,Pm=1,PmGroup=OSProcessingLogicalUnit,MeasurementType=CPULoad.Total</measurementTypeRef>
</measurementSpecification>
<thresholdRateOfVariation>PER_SECOND</thresholdRateOfVariation>
</MeasurementReader>
</PmJob>
</Pm>
我正在使用,cat *gz 1 zgrep -a "PmJobId"
但输出仅显示<pmJobId>
值,而不显示其余信息或标签。
请你的帮助,我对此还很新手。
我使用的是 CentOS - RedHat Linux。
谢谢
答案1
假设中的 XML 文档file.xml
格式良好且在所有方面都是正确的(问题中的示例具有错误的名称空间声明),那么您将能够提取与具有PmJob
以下pmJobID
值的节点相对应的文档部分:包含NOOSS
带有命令行 XML 解析器的子字符串xmlstarlet
。
xmlstarlet sel -t -c '//PmJob[contains(pmJobId,"NOOSS")]' -nl file.xml
此命令选择PmJob
具有子节点 的所有节点pmJobId
,其值包含子字符串NOOSS
。该实用程序将返回所选PmJob
节点及其所有子节点的副本。
答案2
假设 XML 文档格式正确且有效,您可以使用该xmllint
实用程序输出必需的节点。
$ xmllint --xpath '//PmJob[contains(pmJobId,"NOOSS")]' file.xml
<PmJob>
<pmJobId>NOOSSCONTROLExample</pmJobId>
<requestedJobState>ACTIVE</requestedJobState>
<reportingPeriod>FIVE_MIN</reportingPeriod>
<jobType>MEASUREMENTJOB</jobType>
<jobPriority>HIGH</jobPriority>
<granularityPeriod>FIVE_MIN</granularityPeriod>
<jobGroup>Sla</jobGroup>
<reportContentGeneration>CHANGED_ONLY</reportContentGeneration>
<MeasurementReader>
<measurementReaderId>mr_2</measurementReaderId>
<measurementSpecification struct="MeasurementSpecification">
<measurementTypeRef>Anything</measurementTypeRef>
</measurementSpecification>
<thresholdRateOfVariation>PER_SECOND</thresholdRateOfVariation>
</MeasurementReader>
<MeasurementReader>
<measurementReaderId>mr_1</measurementReaderId>
<measurementSpecification struct="MeasurementSpecification">
<measurementTypeRef>ManagedElement=1,SystemFunctions=1,Pm=1,PmGroup=OSProcessingLogicalUnit,MeasurementType=CPULoad.Total</measurementTypeRef>
</measurementSpecification>
<thresholdRateOfVariation>PER_SECOND</thresholdRateOfVariation>
</MeasurementReader>
</PmJob>
$
该实用程序默认安装在许多 Linux 发行版上。