读取特定语句元素中的全文 XML

读取特定语句元素中的全文 XML

我之前的问题是Lua 脚本中的全局正则表达式

基于https://tex.stackexchange.com/users/2891/michal-h21回答我已经创建了这个简单的 Lua 脚本,但无法定义新功能。

我已经对以下XML文件进行了编码。

<p>The investigations of cylindrically symmetric spacetimes can be traced back as far as to 1919 when Levi-Civita (LC) discovered a class of solutions of Einstein&#x2019;s vacuum field equations, corresponding to static cylindrical spacetimes [1]. The extension of the LC spacetimes to stationary ones was obtained independently by Lanczos in 1924 [3] and Lewis in 1932 [9]. In 1925, Beck studied a class of exact solutions and interpreted them as representing the propagation of cylindrical gravitational waves (GWs) [4].</p>
<statement content-type="theorem" id="stat1"><label>Theorem 1.</label><p>Let <inline-formula><mml:math display="inline" overflow="scroll"><mml:mfenced open="(" close=")"><mml:mrow><mml:mi mathvariant="script">M</mml:mi><mml:mo>,</mml:mo><mml:mi>g</mml:mi></mml:mrow></mml:mfenced></mml:math><inline-graphic xlink:href="cqgab7bbaieqn7.gif"/></inline-formula> be a four-dimensional Riemannian spacetime obeying Einstein&#x2019;s field equations, <italic>R</italic><sub><italic>&#x3bc;&#x3bd;</italic></sub> &#x2212; (<italic>R</italic>/2)<italic>g</italic><sub><italic>&#x3bc;&#x3bd;</italic></sub> &#x2212; &#x39b;<italic>g</italic><sub><italic>&#x3bc;&#x3bd;</italic></sub> &#x3d; &#x3f0;<italic>T</italic><sub><italic>&#x3bc;&#x3bd;</italic></sub>. There [a,b] and [z;y] (<italic>c</italic>,<italic>d</italic>) are 134 rose in this [41] garden with <math><mi>x</mi><mo>=</mo><mn>2</mn></math> and some more text with number 1,2,3, etc. and some [45] etc.</p></statement>
<p>He is supported in part by the National Natural Science Foundation of China (NNSCF) with the Grants Nos. 11675145 and 11975203.</p>

我的Lua脚本是:

local xml = "XML INPUT TEXT SHOULD BE HERE" --<p>The investigations of ... and 11975203.</p>
local rgx = ""
local reg = "([^%(%[%)%]0-9,:;]*)([%(%[%)%]0-9,:;]+)"
for w in string.gmatch(xml, "([^%(%[%)%]0-9,]*)([%(%[%)%]0-9,]+)") do
   rgx = rgx .. reg
end
local m = {string.match(xml, rgx)}

local n = {}
for i,v in ipairs(m) do
  j = i%2
  if j==0 then
     table.insert(n,"<rom>"..v.."</rom>")
  else
     table.insert(n,v)
  end
end
print(table.concat(n,""))

此脚本在固定值的情况下运行良好local xml。如何从 XML 读取全局内容?我只需要这个<statement content-type="theorem">,不需要<p>标签。

答案1

这不是那么简单。您不能只使用字符串模式来处理 XML 文件。您需要使用 XML 库(如)来处理它luaxml-domobject,并且只对元素的文本内容使用模式<statement>

这是<statement>示例中重新格式化的元素的样子:

<statement content-type="theorem" id="stat1">
<label>Theorem 1.</label>
<p>Let 
<inline-formula><mml:math display="inline" overflow="scroll"><mml:mfenced open="(" close=")"><mml:mrow><mml:mi mathvariant="script">M</mml:mi><mml:mo>,</mml:mo><mml:mi>g</mml:mi></mml:mrow></mml:mfenced></mml:math>
<inline-graphic xlink:href="cqgab7bbaieqn7.gif"/>
</inline-formula> 
be a four-dimensional Riemannian spacetime obeying Einstein&#x2019;s field equations, 
<italic>R</italic><sub><italic>&#x3bc;&#x3bd;</italic></sub> &#x2212; (<italic>R</italic>/2)<italic>g</italic><sub><italic>&#x3bc;&#x3bd;</italic></sub> &#x2212; &#x39b;<italic>g</italic><sub><italic>&#x3bc;&#x3bd;</italic></sub> &#x3d; &#x3f0;<italic>T</italic><sub><italic>&#x3bc;&#x3bd;</italic></sub>. 
There [a,b] and [z;y] (<italic>c</italic>,<italic>d</italic>) are 134 rose in this [41] garden with 
<math><mi>x</mi><mo>=</mo><mn>2</mn></math> 
and some more text with number 1,2,3, etc. and some [45] etc.</p>
</statement>

您会发现它实际上相当复杂。

现在,如果我理解正确的话,您想<rom>在每个[](),;:字符周围添加元素。 因此,您需要递归处理所有子元素,查找文本并添加<rom>元素。

这是一个库statement-theorem.lua。它导出一个接受 DOM 对象并处理元素的函数statement

local special_pattern = "[%(%[%)%]0-9%,%:%;.]+"


local function split_text(child, newchildren)
  local text = child:get_text()
  local parent = child:get_parent()
  -- 
  local function make_text_node(text)
    if text ~= "" then
      table.insert(newchildren, parent:create_text_node(text))
    end
  end

  local function make_rom(text)
    -- make <rom> element 
    local rom = parent:create_element("rom")
    rom:add_child_node(rom:create_text_node(text))
    table.insert(newchildren, rom)
  end

  local start = 0
  local length = 0
  local prev = 0
  
  local function read_next()
    -- loop over text and find special characters
    start, stop = text:find(special_pattern, prev)
    if start then
      -- part of text between special characers
      local normal = text:sub(prev, start - 1)
      local special = text:sub(start, stop)
      make_text_node(normal)
      make_rom(special)
      prev = stop + 1
      return true
    else
      -- process text after the last special character
      make_text_node(text:sub(prev, text:len()))
      return false
    end
  end
  while read_next() do
  end
end



local function add_roman(element)
  -- process all child elements of statement, find text content and add <rom>
  -- elements to numbers and braces
  local newchildren = {}
  for _, child in ipairs(element:get_children()) do
    if child:is_text() then
      local text = child:get_text()
      -- detect if text contains special characters
      if text:match(special_pattern) then
        -- process only text that contain special characters
        split_text(child, newchildren)
      else
        table.insert(newchildren, child)
      end
    else
      if child:is_element() then
        -- recursivelly process child elements, but ignore mathml
        if not child:get_element_name():match(":?math$") then
          add_roman(child)
        end
      end
      table.insert(newchildren, child)
    end
  end
  element._children = newchildren
end

local function process_theorems(dom)
  -- we want to process all <statement> elements
  for _, statement in ipairs(dom:query_selector "statement[content-type='theorem']") do
    add_roman(statement)
  end
end

-- return the processing function
return process_theorems

我预计您不想处理 MathML,所以它不会处理<math>元素。

它可以通过如下脚本使用:

kpse.set_program_name "luatex"
-- require LuaXML DOM library and load XML file from the standard input
local domobject = require "luaxml-domobject"
local process_theorems = require "statement-theorem"
local input = io.read("*all")
local dom = domobject.parse(input)

process_theorems(dom)

print(dom:serialize())

它可以像这样使用:

texlua addrom.lua < sample.xml

请注意,您必须在 XML 文件中使用根元素,因此我添加了一个虚拟<root>元素来使其工作。以下是生成的 XML:

<root>
<p>The investigations of cylindrically symmetric spacetimes can be traced back as far as to 1919 when Levi-Civita (LC) discovered a class of solutions of Einstein’s vacuum field equations, corresponding to static cylindrical spacetimes [1]. The extension of the LC spacetimes to stationary ones was obtained independently by Lanczos in 1924 [3] and Lewis in 1932 [9]. In 1925, Beck studied a class of exact solutions and interpreted them as representing the propagation of cylindrical gravitational waves (GWs) [4].</p>
<statement id='stat1' content-type='theorem'>
<label>Theorem <rom>1.</rom></label>
<p>Let <inline-formula><mml:math display='inline' overflow='scroll'><mml:mfenced close=')' open='('><mml:mrow><mml:mi mathvariant='script'>M</mml:mi><mml:mo>,</mml:mo><mml:mi>g</mml:mi></mml:mrow></mml:mfenced></mml:math><inline-graphic xlink:href='cqgab7bbaieqn7.gif'></inline-graphic></inline-formula> 
 be a four-dimensional Riemannian spacetime obeying Einstein’s field equations<rom>,</rom> 
 <italic>R</italic><sub><italic>μν</italic></sub> − 
 <rom>(</rom><italic>R</italic>/<rom>2)</rom><italic>g</italic><sub><italic>μν</italic></sub> − Λ<italic>g</italic><sub><italic>μν</italic></sub> = ϰ<italic>T</italic><sub><italic>μν</italic></sub><rom>.</rom> 
 There <rom>[</rom>a<rom>,</rom>b<rom>]</rom> and <rom>[</rom>z<rom>;</rom>y<rom>]</rom> <rom>(</rom><italic>c</italic><rom>,</rom><italic>d</italic><rom>)</rom> are <rom>134</rom> rose in this <rom>[41]</rom> garden with <math><mi>x</mi><mo>=</mo><mn>2</mn></math> and some more text with number <rom>1,2,3,</rom> etc<rom>.</rom> and some <rom>[45]</rom> etc<rom>.</rom></p></statement>
<p>He is supported in part by the National Natural Science Foundation of China (NNSCF) with the Grants Nos. 11675145 and 11975203.</p>
</root>

相关内容