使用 pdflatex 将 Latex 转换为 XML

使用 pdflatex 将 Latex 转换为 XML

我正在将 Latex 写入XML。在编译时,pdflatex我们使用标记命令生成xml文件\immediate\write{text}。但是我怎样才能将普通文本写入XML文件。有人能用示例解释一下吗?

答案1

我不知道使用 LaTeX 和 pdftex 引擎的可能解决方案,但 ConTeXt MkIV(使用 LuaTeX 引擎)支持用于生成 XML 后端电子书标记的 PDF

要从文件获取 XML 输出,您需要添加

\setupbackend[export=yes]

举例来说,考虑一个包含一些图形、数学和列表的简单文件。

\setupbackend[export=yes]
\setuppapersize[A5]
\starttext

\startsection[title={Sample Section}]

  \startplacefigure
      [location=right, title={A sample figure}]
      \externalfigure[cow][width=2cm]
  \stopplacefigure

  \input knuth

  \placeformula[eq:1]
  \startformula
    E = mc^2 
  \stopformula

  Einstein gave the expression~(\in[eq:1]).

  \startitemize[n]
    \startitem
      First point
    \stopitem

    \startitem
      Second point
    \stopitem
  \stopitemize
\stopsection

\stoptext

生成以下 PDF 输出

在此处输入图片描述

此外,它还生成以下 XML 文件\jobname.export(请注意,所有结构信息都保留,并且数学导出为 MathML)

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>

<!-- input filename   : test              -->
<!-- processing date  : Tue Dec  4 00:21:55 2012 -->
<!-- context version  : 2012.11.16 23:51  -->
<!-- exporter version : 0.30              -->


<document language="en" file="test" date="Tue Dec  4 00:21:55 2012" context="2012.11.16 23:51" version="0.30" xmlns:m="http://www.w3.org/1998/Math/MathML">
  <section detail="section" location='aut:1'>
    <sectionnumber>1</sectionnumber> 
    <sectiontitle>Sample Section</sectiontitle> 
    <sectioncontent>
      <float detail="figure" location='aut:2'>
        <floatcontent><image name="cow" id='image-1' width='2.000cm' height='1.455cm'></image></floatcontent>
        <floatcaption><floatlabel detail="figure">Figure </floatlabel><floatnumber detail="figure">1</floatnumber> <floattext>A sample figure</floattext></floatcaption>
      </float>
Thus, I came to the conclusion that the designer of a new system must not only be the implementer and first large--scale user; the designer should also write the first user manual.
      <break/>
The separation of any of these four components would have hurt TEX significantly. If I had not participated fully in all these activities, literally hundreds of improvements would never have been made, because I would never have thought of them or perceived why they were important.
      <break/>
But a system cannot be successful if it is too strongly influenced by a single person. Once the initial design is complete and fairly robust, the real test begins as people with many different viewpoints undertake their own experiments.
      <formula>
        <formulacontent>
          <m:math display="block">
            <m:mrow>
              <m:mi>

答案2

我认为还有另一种方法:LaTeXML:LaTeX 到 XML 转换器

安装完成后,可以按如下方式进行。

考虑以下 MWE test_xml.tex

\documentclass[a4paper,11pt]{article}
\usepackage{graphicx}

\begin{document}
Here is some text that precedes the image.
\begin{figure}
\includegraphics[scale=0.5]{ctan_lion} % http://www.ctan.org/lion.html
\end{figure}

Here is a formula:
\begin{equation}
e=mc^2
\end{equation}

\end{document}

我们有一个外部模块,graphicx可以直接绑定:参见手册第 5 页(加载绑定)和附录 B。因此我们只需要处理终端:

latexml --preload=graphicx.sty --preload=LaTeX.pool --destination=test_xml.xml test_xml

结果test_xml.xml是:

<?xml version="1.0" encoding="UTF-8"?>
<?latexml searchpaths=".,//home/claudio/Scrivania/prova/"?>
<?latexml package="graphicx"?>
<?latexml options="a4paper,11pt" class="article"?>
<?latexml package="graphicx"?>
<?latexml RelaxNGSchema="LaTeXML"?>
<document xmlns="http://dlmf.nist.gov/LaTeXML">
  <para xml:id="p1">
    <p>Here is some text that precedes the image.</p>
  </para>
  <figure refnum="1" xml:id="S0.F1">
    <graphics graphic="ctan_lion" options="scale=0.5"/>
    <!-- %http://www.ctan.org/lion.html -->
  </figure>
  <para xml:id="p2">
    <p>Here is a formula:</p>
    <equation refnum="1" xml:id="S0.E1">
      <Math mode="display" tex="e=mc^{2}" xml:id="S0.E1.m1" text="e = m * c ^ 2">
        <XMath>
          <XMApp>
            <XMTok meaning="equals" role="RELOP">=</XMTok>
            <XMTok role="UNKNOWN" font="italic">e</XMTok>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok role="UNKNOWN" font="italic">m</XMTok>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post2"/>
                <XMTok role="UNKNOWN" font="italic">c</XMTok>
                <XMTok meaning="2" role="NUMBER">2</XMTok>
              </XMApp>
            </XMApp>
          </XMApp>
        </XMath>
      </Math>
    </equation>
  </para>
</document>

现在,也可以进行一些后处理以获得例如.xhtml.html文件(当然不仅仅是这些文件,请参阅手册以供参考)。

对于.xhtml文件:

latexmlpost --graphicimages --destination=test_xml.xhtml test_xml

对于.html文件:

latexmlpost --format=html --graphicimages --destination=test_xml.html test_xml

这些操作将自动转换公式和图像(因为有选项--graphicimages)。结果将类似于:

在此处输入图片描述

相关内容