有类似的问题,但没有一个完全解决我遇到的问题。
简而言之,我需要打印包含我要查找的任何字符串的每个块。每个块起始行包含: <entry version=
欲了解更多详情,请参阅下文:
如果在模式区域(块)内识别出某个字符串,我想搜索打印模式之间的每个整个区域(块)的大型文件(数十万行)。
我知道我可以打印模式之间的整个区域,其中这些块的开始和结束标识符是“/<entry version=”,使用:
awk '/<entry version=/{flag=1} flag; /<entry version=/{flag=0}'
但是如果在这些模式之间找到某些字符串,如何让它只打印整个块?
对于块区域来说,真实数据的最短部分看起来像这样(尽管实际上每个块都有数千行长),我要感谢 Terdon 整理了一个更好的示例供我使用:
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000003">
<name>TSPAN6</name>
<synonym>T245</synonym>
<synonym>TM4SF6</synonym>
<synonym>TSPAN-6</synonym>
<identifier id="ENSG00000000003" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="O43657" db="Uniprot/SWISSPROT"/>
<xref id="7105" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
在上面的真实格式中,我将检查特定字符串的名称和同义词,因此如果我正在寻找“TSPAN6”,那么将打印该块。每个块都有数千行,因此下面只是一个虚构的迷你版本,用于解释我如何根据块内的字符串匹配来打印块。
这是一个示例,如果我的字符串是“MEMSAT”和“TNMD”
示例输入:
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000003">
<name>TSPAN6</name>
<synonym>T245</synonym>
<synonym>TM4SF6</synonym>
<synonym>TSPAN-6</synonym>
<identifier id="ENSG00000000003" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="O43657" db="Uniprot/SWISSPROT"/>
<xref id="7105" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
<proteinClass source="Ezkurdia et al 2014" id="Eb" parent_id="" name="Protein evidence (Ezkurdia et al 2014)"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence source="HPA" evidence="Evidence at transcript level"/>
<evidence source="MS" evidence="Not available"/>
<evidence source="UniProt" evidence="Evidence at protein level"/>
</proteinEvidence>
</entry>
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000005">
<name>TNMD</name>
<synonym>BRICD4</synonym>
<synonym>ChM1L</synonym>
<synonym>myodulin</synonym>
<synonym>TEM</synonym>
<synonym>tendin</synonym>
<identifier id="ENSG00000000005" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="Q9H2S6" db="Uniprot/SWISSPROT"/>
<xref id="64102" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
<proteinClass source="MDM" id="Md" parent_id="" name="Membrane proteins predicted by MDM"/>
<proteinClass source="MEMSAT3" id="Me" parent_id="" name="MEMSAT3 predicted membrane proteins"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence source="HPA" evidence="Evidence at transcript level"/>
<evidence source="MS" evidence="Not available"/>
<evidence source="UniProt" evidence="Evidence at protein level"/>
</proteinEvidence>
</entry>
输出示例:
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000005">
<name>TNMD</name>
<synonym>BRICD4</synonym>
<synonym>ChM1L</synonym>
<synonym>myodulin</synonym>
<synonym>TEM</synonym>
<synonym>tendin</synonym>
<identifier id="ENSG00000000005" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="Q9H2S6" db="Uniprot/SWISSPROT"/>
<xref id="64102" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
<proteinClass source="MDM" id="Md" parent_id="" name="Membrane proteins predicted by MDM"/>
<proteinClass source="MEMSAT3" id="Me" parent_id="" name="MEMSAT3 predicted membrane proteins"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence source="HPA" evidence="Evidence at transcript level"/>
<evidence source="MS" evidence="Not available"/>
<evidence source="UniProt" evidence="Evidence at protein level"/>
</proteinEvidence>
</entry>
答案1
使用乐(以前称为 Perl_6)
~$ raku -MXML -e 'my $xml = open-xml($*ARGFILES.Str); \
.say for $xml.getElementsByTagName("entry").grep(/ TSPAN6 | TNMD /).pairs;' file.xml
#OR
~$ raku -MXML -e 'my @xml = open-xml($*ARGFILES.Str).getElementsByTagName("entry"); \
my @names = <TSPAN6 TNMD>; .say for @xml.grep(/@names/).pairs;' file.xml
如果您拥有真实的 XML 文件,那么最好使用真实的 XML 解析器。在上面的代码中,您使用XML
命令行标志调用 Raku 社区模块-MXML
(顺便说一句,这与使用 Perl 在命令行调用模块的方式相同)。花点时间熟悉当前的 XML 模式,然后相应地规划您的编码:
https://www. Proteinatlas.org/download/ Proteinatlas.xsd
上面的第一个答案将您的文件打开为$xml
XML 文档。然后 XML 文档被分解为entry
使用命名的(顶级)元素.getElementsByTagName();
。最后,遍历每个元素grep
以获得包含所需字符串的元素。
上面的第二个答案搜索名为的 XML 元素entry
并将文件打开到@xml
Raku 数组中。然后遍历每个元素grep
以获得包含所需字符串的元素,这些字符串已保存在数组中@names
。
示例输入(取自@terdon的优秀答案):
<?xml version="1.0" encoding="UTF-8"?>
<proteinAtlas xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://v21.proteinatlas.org/download/proteinatlas.xsd" schemaVersion="2.6">
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000003">
<name>TSPAN6</name>
<synonym>T245</synonym>
<synonym>TM4SF6</synonym>
<synonym>TSPAN-6</synonym>
<identifier id="ENSG00000000003" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="O43657" db="Uniprot/SWISSPROT"/>
<xref id="7105" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
<proteinClass source="Ezkurdia et al 2014" id="Eb" parent_id="" name="Protein evidence (Ezkurdia et al 2014)"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence source="HPA" evidence="Evidence at transcript level"/>
<evidence source="MS" evidence="Not available"/>
<evidence source="UniProt" evidence="Evidence at protein level"/>
</proteinEvidence>
</entry>
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000005">
<name>TNMD</name>
<synonym>BRICD4</synonym>
<synonym>ChM1L</synonym>
<synonym>myodulin</synonym>
<synonym>TEM</synonym>
<synonym>tendin</synonym>
<identifier id="ENSG00000000005" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="Q9H2S6" db="Uniprot/SWISSPROT"/>
<xref id="64102" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
<proteinClass source="MDM" id="Md" parent_id="" name="Membrane proteins predicted by MDM"/>
<proteinClass source="MEMSAT3" id="Me" parent_id="" name="MEMSAT3 predicted membrane proteins"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence source="HPA" evidence="Evidence at transcript level"/>
<evidence source="MS" evidence="Not available"/>
<evidence source="UniProt" evidence="Evidence at protein level"/>
</proteinEvidence>
</entry>
</proteinAtlas>
示例输出:
0 => <entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000003">
<name>TSPAN6</name>
<synonym>T245</synonym>
<synonym>TM4SF6</synonym>
<synonym>TSPAN-6</synonym>
<identifier version="103.38" gencodeVersion="37" assembly="GRCh38.p13" db="Ensembl" id="ENSG00000000003">
<xref id="O43657" db="Uniprot/SWISSPROT"/>
<xref db="NCBI GeneID" id="7105"/>
</identifier>
<proteinClasses>
<proteinClass id="Ma" source="MDM" parent_id="" name="Predicted membrane proteins"/>
<proteinClass name="Protein evidence (Ezkurdia et al 2014)" parent_id="" id="Eb" source="Ezkurdia et al 2014"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence source="HPA" evidence="Evidence at transcript level"/>
<evidence evidence="Not available" source="MS"/>
<evidence evidence="Evidence at protein level" source="UniProt"/>
</proteinEvidence>
</entry>
1 => <entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000005">
<name>TNMD</name>
<synonym>BRICD4</synonym>
<synonym>ChM1L</synonym>
<synonym>myodulin</synonym>
<synonym>TEM</synonym>
<synonym>tendin</synonym>
<identifier version="103.38" assembly="GRCh38.p13" db="Ensembl" gencodeVersion="37" id="ENSG00000000005">
<xref id="Q9H2S6" db="Uniprot/SWISSPROT"/>
<xref db="NCBI GeneID" id="64102"/>
</identifier>
<proteinClasses>
<proteinClass name="Predicted membrane proteins" source="MDM" id="Ma" parent_id=""/>
<proteinClass parent_id="" id="Md" name="Membrane proteins predicted by MDM" source="MDM"/>
<proteinClass id="Me" name="MEMSAT3 predicted membrane proteins" parent_id="" source="MEMSAT3"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence evidence="Evidence at transcript level" source="HPA"/>
<evidence evidence="Not available" source="MS"/>
<evidence evidence="Evidence at protein level" source="UniProt"/>
</proteinEvidence>
</entry>
上面的调用pairs
只是对输出元素进行编号。最后,@terdon 的评论正确地指出,grep
基因名称可能不是您最安全的选择。如果您改为搜索Id
s,则可以极大地简化返回值(如果确实如此,返回的有限子集对您有用):
~$ raku -MXML -e 'my $xml=open-xml($*ARGFILES.Str); put $xml.getElementById("ENSG00000000003"|"ENSG00000000005").pairs;' file.xml
返回:
0 <identifier version="103.38" gencodeVersion="37" id="ENSG00000000003" db="Ensembl" assembly="GRCh38.p13">
<xref id="O43657" db="Uniprot/SWISSPROT"/>
<xref id="7105" db="NCBI GeneID"/>
</identifier>
0 <identifier gencodeVersion="37" assembly="GRCh38.p13" version="103.38" db="Ensembl" id="ENSG00000000005">
<xref db="Uniprot/SWISSPROT" id="Q9H2S6"/>
<xref db="NCBI GeneID" id="64102"/>
</identifier>
https://github.com/raku-community-modules/XML
https://rakudo.org/
https://raku.org
答案2
假设输入是格式良好的 XML 文档(如特登的回答,但不是问题中显示的内容),您可以用来输出具有特定和属性的xmlstarlet
每个节点的副本。entry
name
proteinClass
source
xmlstarlet select --template \
--copy-of '//entry[name = "TNMD" and proteinClasses/proteinClass/@source = "MEMSAT3"]' \
-nl file
这将选择entry
具有特定属性值的所有节点name
,并且该节点还具有具有proteinClasses/proteinClass
特定属性值的子节点source
。entry
将输出每个匹配节点的副本,并添加尾随换行符。
答案3
您可以在 GNU awk 中通过使用</entry[^>]*>
作为记录分隔符来执行此操作。例如,使用此文件作为输入:
<?xml version="1.0" encoding="UTF-8"?>
<proteinAtlas xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://v21.proteinatlas.org/download/proteinatlas.xsd" schemaVersion="2.6">
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000003">
<name>TSPAN6</name>
<synonym>T245</synonym>
<synonym>TM4SF6</synonym>
<synonym>TSPAN-6</synonym>
<identifier id="ENSG00000000003" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="O43657" db="Uniprot/SWISSPROT"/>
<xref id="7105" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
<proteinClass source="Ezkurdia et al 2014" id="Eb" parent_id="" name="Protein evidence (Ezkurdia et al 2014)"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence source="HPA" evidence="Evidence at transcript level"/>
<evidence source="MS" evidence="Not available"/>
<evidence source="UniProt" evidence="Evidence at protein level"/>
</proteinEvidence>
</entry>
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000005">
<name>TNMD</name>
<synonym>BRICD4</synonym>
<synonym>ChM1L</synonym>
<synonym>myodulin</synonym>
<synonym>TEM</synonym>
<synonym>tendin</synonym>
<identifier id="ENSG00000000005" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="Q9H2S6" db="Uniprot/SWISSPROT"/>
<xref id="64102" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
<proteinClass source="MDM" id="Md" parent_id="" name="Membrane proteins predicted by MDM"/>
<proteinClass source="MEMSAT3" id="Me" parent_id="" name="MEMSAT3 predicted membrane proteins"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence source="HPA" evidence="Evidence at transcript level"/>
<evidence source="MS" evidence="Not available"/>
<evidence source="UniProt" evidence="Evidence at protein level"/>
</proteinEvidence>
</entry>
</proteinAtlas>
您可以通过以下方式获取数据TNMD
:
$ gawk 'BEGIN{ RS="</entry[^>]*>" } /TNMD/' a
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000005">
<name>TNMD</name>
<synonym>BRICD4</synonym>
<synonym>ChM1L</synonym>
<synonym>myodulin</synonym>
<synonym>TEM</synonym>
<synonym>tendin</synonym>
<identifier id="ENSG00000000005" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="Q9H2S6" db="Uniprot/SWISSPROT"/>
<xref id="64102" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
<proteinClass source="MDM" id="Md" parent_id="" name="Membrane proteins predicted by MDM"/>
<proteinClass source="MEMSAT3" id="Me" parent_id="" name="MEMSAT3 predicted membrane proteins"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence source="HPA" evidence="Evidence at transcript level"/>
<evidence source="MS" evidence="Not available"/>
<evidence source="UniProt" evidence="Evidence at protein level"/>
</proteinEvidence>
这只是意味着“如果匹配则打印此行TNMD
”。当然,如果该行是类似的东西,它也会打印,87% identity to TNMD
并且它肯定会在各种边缘情况下中断,因为我们没有使用正确的解析器。
使用适当的解析器,您可以准确指定字符串应该在的位置。