我试图在我想要的文本前面<w:t>
的标签和</w:t>
末尾的标签中提取一些行,但我只获取最后一个标签中的文本,而不是其他标签。我怎样才能做到这一点?
这是我一直在尝试使用的代码:
grep '<w:t>' word/document.xml | sed 's/.*<w:t>\(.*)<\/w:t>.*/\1/' | cat > brev.txt
正如您所看到的,我正在grep
从document.xml
word 目录中的文件中查找文件中的标签并将其传输到名为 的文件中brev.txt
,但它并不能完全工作。如何获取所有行,而不仅仅是带有标签的最后一行?
该document.xml
文件是一个单行文本文件(如果有什么区别的话)。
我还尝试了另一个代码,这给了我从第一个<w:t>
标签到最后一个</w:t>
标签的所有内容。其中有很多额外的文本,以下代码是:
grep -o '<w:t>.*</w:t>' word/document.xml | sed 's/\(<w:t>\|<\/w:t>\)//g' > brev.txt
示例文件(为了便于阅读而格式化;原始文件是单行)
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se w16cid wp14">
<w:body>
<w:p w14:paraId="35B527D8" w14:textId="4CF0BDCB" w:rsidR="0068138C" w:rsidRDefault="00BF1E48">
<w:r>
<w:t>Here’s a Word document. It has several sentences.</w:t>
</w:r>
</w:p>
<w:p w14:paraId="4AADFADF" w14:textId="4F49E2CE" w:rsidR="00BF1E48" w:rsidRDefault="00BF1E48">
<w:r>
<w:t>Most are short.</w:t>
</w:r>
</w:p>
<w:p w14:paraId="608ED30C" w14:textId="2163C420" w:rsidR="00BF1E48" w:rsidRDefault="00BF1E48">
<w:r>
<w:t>All are in English.</w:t>
</w:r>
</w:p>
<w:p w14:paraId="0B67C683" w14:textId="77777777" w:rsidR="00BF1E48" w:rsidRDefault="00BF1E48">
<w:bookmarkStart w:id="0" w:name="_GoBack"/>
<w:bookmarkEnd w:id="0"/>
</w:p>
<w:sectPr w:rsidR="00BF1E48">
<w:pgSz w:w="11906" w:h="16838"/>
<w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/>
<w:cols w:space="708"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
</w:document>
答案1
使用 XML 解析器来解析 XML。使用我添加到您的问题中的示例文档,
xmlstarlet sel -t -v '//w:t' -n word/document.xml >brev.txt
cat brev.txt
Here’s a Word document. It has several sentences.
Most are short.
All are in English.
如果您确实无法掌握 XML 解析器,但您有 GNU grep
,则可以使用此模式。但这是解决问题的错误方法
grep -oP '(?<=<w:t>).*?(?=</w:t>)' word/document.xml