读取特定语句元素中的全文 XML

Question

这不是那么简单。您不能只使用字符串模式来处理 XML 文件。您需要使用 XML 库（如）来处理它luaxml-domobject，并且只对元素的文本内容使用模式<statement>。

这是<statement>示例中重新格式化的元素的样子：

<statement content-type="theorem" id="stat1">
<label>Theorem 1.</label>
<p>Let 
<inline-formula><mml:math display="inline" overflow="scroll"><mml:mfenced open="(" close=")"><mml:mrow><mml:mi mathvariant="script">M</mml:mi><mml:mo>,</mml:mo><mml:mi>g</mml:mi></mml:mrow></mml:mfenced></mml:math>
<inline-graphic xlink:href="cqgab7bbaieqn7.gif"/>
</inline-formula> 
be a four-dimensional Riemannian spacetime obeying Einstein&#x2019;s field equations, 
<italic>R</italic><sub><italic>&#x3bc;&#x3bd;</italic></sub> &#x2212; (<italic>R</italic>/2)<italic>g</italic><sub><italic>&#x3bc;&#x3bd;</italic></sub> &#x2212; &#x39b;<italic>g</italic><sub><italic>&#x3bc;&#x3bd;</italic></sub> &#x3d; &#x3f0;<italic>T</italic><sub><italic>&#x3bc;&#x3bd;</italic></sub>. 
There [a,b] and [z;y] (<italic>c</italic>,<italic>d</italic>) are 134 rose in this [41] garden with 
<math><mi>x</mi><mo>=</mo><mn>2</mn></math> 
and some more text with number 1,2,3, etc. and some [45] etc.</p>
</statement>

您会发现它实际上相当复杂。

现在，如果我理解正确的话，您想<rom>在每个[](),;:字符周围添加元素。因此，您需要递归处理所有子元素，查找文本并添加<rom>元素。

这是一个库statement-theorem.lua。它导出一个接受 DOM 对象并处理元素的函数statement：

local special_pattern = "[%(%[%)%]0-9%,%:%;.]+"


local function split_text(child, newchildren)
  local text = child:get_text()
  local parent = child:get_parent()
  -- 
  local function make_text_node(text)
    if text ~= "" then
      table.insert(newchildren, parent:create_text_node(text))
    end
  end

  local function make_rom(text)
    -- make <rom> element 
    local rom = parent:create_element("rom")
    rom:add_child_node(rom:create_text_node(text))
    table.insert(newchildren, rom)
  end

  local start = 0
  local length = 0
  local prev = 0
  
  local function read_next()
    -- loop over text and find special characters
    start, stop = text:find(special_pattern, prev)
    if start then
      -- part of text between special characers
      local normal = text:sub(prev, start - 1)
      local special = text:sub(start, stop)
      make_text_node(normal)
      make_rom(special)
      prev = stop + 1
      return true
    else
      -- process text after the last special character
      make_text_node(text:sub(prev, text:len()))
      return false
    end
  end
  while read_next() do
  end
end



local function add_roman(element)
  -- process all child elements of statement, find text content and add <rom>
  -- elements to numbers and braces
  local newchildren = {}
  for _, child in ipairs(element:get_children()) do
    if child:is_text() then
      local text = child:get_text()
      -- detect if text contains special characters
      if text:match(special_pattern) then
        -- process only text that contain special characters
        split_text(child, newchildren)
      else
        table.insert(newchildren, child)
      end
    else
      if child:is_element() then
        -- recursivelly process child elements, but ignore mathml
        if not child:get_element_name():match(":?math$") then
          add_roman(child)
        end
      end
      table.insert(newchildren, child)
    end
  end
  element._children = newchildren
end

local function process_theorems(dom)
  -- we want to process all <statement> elements
  for _, statement in ipairs(dom:query_selector "statement[content-type='theorem']") do
    add_roman(statement)
  end
end

-- return the processing function
return process_theorems

我预计您不想处理 MathML，所以它不会处理<math>元素。

它可以通过如下脚本使用：

kpse.set_program_name "luatex"
-- require LuaXML DOM library and load XML file from the standard input
local domobject = require "luaxml-domobject"
local process_theorems = require "statement-theorem"
local input = io.read("*all")
local dom = domobject.parse(input)

process_theorems(dom)

print(dom:serialize())

它可以像这样使用：

texlua addrom.lua < sample.xml

请注意，您必须在 XML 文件中使用根元素，因此我添加了一个虚拟<root>元素来使其工作。以下是生成的 XML：

<root>
<p>The investigations of cylindrically symmetric spacetimes can be traced back as far as to 1919 when Levi-Civita (LC) discovered a class of solutions of Einstein’s vacuum field equations, corresponding to static cylindrical spacetimes [1]. The extension of the LC spacetimes to stationary ones was obtained independently by Lanczos in 1924 [3] and Lewis in 1932 [9]. In 1925, Beck studied a class of exact solutions and interpreted them as representing the propagation of cylindrical gravitational waves (GWs) [4].</p>
<statement id='stat1' content-type='theorem'>
<label>Theorem <rom>1.</rom></label>
<p>Let <inline-formula><mml:math display='inline' overflow='scroll'><mml:mfenced close=')' open='('><mml:mrow><mml:mi mathvariant='script'>M</mml:mi><mml:mo>,</mml:mo><mml:mi>g</mml:mi></mml:mrow></mml:mfenced></mml:math><inline-graphic xlink:href='cqgab7bbaieqn7.gif'></inline-graphic></inline-formula> 
 be a four-dimensional Riemannian spacetime obeying Einstein’s field equations<rom>,</rom> 
 <italic>R</italic><sub><italic>μν</italic></sub> − 
 <rom>(</rom><italic>R</italic>/<rom>2)</rom><italic>g</italic><sub><italic>μν</italic></sub> − Λ<italic>g</italic><sub><italic>μν</italic></sub> = ϰ<italic>T</italic><sub><italic>μν</italic></sub><rom>.</rom> 
 There <rom>[</rom>a<rom>,</rom>b<rom>]</rom> and <rom>[</rom>z<rom>;</rom>y<rom>]</rom> <rom>(</rom><italic>c</italic><rom>,</rom><italic>d</italic><rom>)</rom> are <rom>134</rom> rose in this <rom>[41]</rom> garden with <math><mi>x</mi><mo>=</mo><mn>2</mn></math> and some more text with number <rom>1,2,3,</rom> etc<rom>.</rom> and some <rom>[45]</rom> etc<rom>.</rom></p></statement>
<p>He is supported in part by the National Natural Science Foundation of China (NNSCF) with the Grants Nos. 11675145 and 11975203.</p>
</root>

Answer 1