Make4ht 输出带实体的 XML 格式为十六进制值

Make4ht 输出带实体的 XML 格式为十六进制值

如何获取所有应该格式化的特殊字符\nablaxml entity(∇)不是直接utf字符?

我的妻子是:

\documentclass{acm-book}
\usepackage{balance}
\usepackage{amsmath}
\usepackage{booktabs,hyperref,listings,xcolor,colortbl}
\usepackage[inactive]{fancytooltips}
\usepackage{wrapfig}
\usepackage{afterpage}
\usepackage{makeidx}

\begin{document}

\chapter{What is {\textquotedblleft}Software{\textquotedblright}?}

Furthermore, legacy software systems are notoriously difficult to replace. As noted experienced by this author as a chief information officer, legacy systems take considerable effort and money to replace and tend to be built upon, rather than replaced. So, those working on systems for complex organizations are likely to have to deal with these existing software systems. US Social Security Administration still dependencies on legacy software further entrenches its use. Other systems used by the US government have software sub-systems

\begin{align}
   IR_{\text{max(sec)}} = \frac{B}{\beta} \{\max_x\{H(X) - H(X|Y)\}\}
\end{align}

\noindent where $C$ denotes the concentration of molecules, $D$ is the diffusion coefficient of medium given by $D = K_BT/6\pi\eta r_m$, $\nabla^2$ denotes the squared-differential operator given in Cartesian coordinates $\{x,y,z\}$, $\nabla^2 = i\frac{\partial^2}{\partial x^2} + j\frac{\partial^2}{\partial y^2} + k\frac{\partial^2}{\partial z^2}$, $T$ is the temperature of operation degree Kelvin, $\eta$ is the viscosity of the medium, $r_m$ is the radius of the information molecule and $K_B$ is the Boltzmann constant.

When reusing existing software, it is wise to evaluate the relevance of the techniques and assumptions that were used in building that original software. This book focuses on software as a technology and how it has evolved over time. We will look at the trends, important innovations, and events, as well as the ever-broadening world of software.

\end{document}

答案1

问题在于,用于 XML 文件后处理的 LuaXML 几乎将所有内容都转换为 Unicode 文本,即使您使用了 XML 实体也是如此。您需要使用修改后的 DOM 序列化函数将这些字符转换为实体。但老实说,我不明白为什么在 2022 年您会想使用实体而不是 UTF-8 字符。这似乎是一个奇怪的要求。

无论如何,尝试一下这个文件build.lua

-- local domobject = require "luaxml-domobject"
local filter = require "make4ht-filter"
local domobject = require "luaxml-domobject"


local codes = utf8.codes
local uchar = utf8.char
local escapes = {
  [">"] = ">",
  ["<"] = "&lt;",
  ["&"] = "&amp;",
  ['"'] = "&quot;",
  ["'"] = "&#39;",
  ["`"] = "&#x60;"
}


local escape_element = function(text)
  local t = {}
  for _, codepoint in codes(text) do
    if codepoint > 128 then
      t[#t+1] = string.format("&#x%x;", codepoint) 
    else
      local char = uchar(codepoint)
      t[#t+1] =  char -- escapes[char] or char
    end
  end
  local result = table.concat(t)
  return result
end

-- this is a copy of serializing stuff from luaxml-domobject.lua
local void = {area = true, base = true, br = true, col = true, hr = true, img = true, input = true, link = true, meta = true, param = true}

local escapes = {
  [">"] = "&gt;",
  ["<"] = "&lt;",
  ["&"] = "&amp;",
  ['"'] = "&quot;",
  ["'"] = "&#39;",
  ["`"] = "&#x60;"
}

local function escape(search, text)
  return text:gsub(search, function(ch)
    return escapes[ch] or ""
  end)
end


local function escape_attr(text)
  return escape("([<>&\"'`])", text)
end

local actions = {
  TEXT = {text = "%s"},
  COMMENT = {start = "<!-- ", text = "%s", stop = " -->"},
  ELEMENT = {start = "<%s%s>", stop = "</%s>", void = "<%s%s />"},
  DECL = {start = "<?%s %s?>"},
  PI = {start = "<?%s %s?>"},
  DTD = {start = "<!DOCTYPE ", text = "%s" , stop=">"},
  CDATA = {start = "<![CDATA[", text = "%s", stop ="]]>"}
  
}

--- It serializes the DOM object back to the XML.
-- This function is mainly used for internal purposes, it is better to
-- use the `DOM_Object:serialize()`.
-- @param parser DOM object
-- @param current Element which should be serialized
-- @param level 
-- @param output
-- @return table Table with XML strings. It can be concenated using table.concat() function to get XML string corresponding to the DOM_Object.
local function serialize_dom(parser, current,level, output)
  local output = output or {}
  local function get_action(typ, action)
    local ac = actions[typ] or {}
    local format = ac[action] or ""
    return format
  end
  local function insert(format, ...)
    table.insert(output, string.format(format, ...))
  end
  local function prepare_attributes(attr)
    local t = {}
    local attr = attr or {}
    for k, v in pairs(attr) do
      t[#t+1] = string.format("%s='%s'", k, escape_attr(v))
    end
    -- sort attributes alphabetically. this will ensure that
    -- their order will not change between several executions of dom:serialize()
    table.sort(t)
    if #t == 0 then return "" end
    -- add space before attributes
    return " " .. table.concat(t, " ")
  end
  local function start(typ, el, attr)
    local format = get_action(typ, "start")
    insert(format, el, prepare_attributes(attr))
  end
  local function text(typ, text)
    local format = get_action(typ, "text")
    insert(format, escape_element(text))
  end
  local function stop(typ, el)
    local format = get_action(typ, "stop")
    insert(format,el)
  end
  local level = level or 0
  local spaces = string.rep(" ",level)
  local root= current or parser._handler.root
  local name = root._name or "unnamed"
  local xtype = root._type or "untyped"
  local text_content = root._text or ""
  local attributes = root._attr or {}
  -- if xtype == "TEXT" then
  --   print(spaces .."TEXT : " .. root._text)
  -- elseif xtype == "COMMENT" then
  --   print(spaces .. "Comment : ".. root._text)
  -- else
  --   print(spaces .. xtype .. " : " .. name)
  -- end
  -- for k, v in pairs(attributes) do
  --   print(spaces .. " ".. k.."="..v)
  -- end
  if xtype == "DTD" then
    text_content = string.format('%s %s "%s" "%s"', name, attributes["_type"] or "",  attributes._name, attributes._uri )
    -- remove unused fields
    text_content = text_content:gsub('"nil"','')
    text_content = text_content:gsub('%s*$','')
    attributes = {}
  elseif xtype == "ELEMENT" and void[name] and #current._children < 1 then
    local format = get_action(xtype, "void")
    insert(format, name, prepare_attributes(attributes))
    return output
  elseif xtype == "PI" then
    -- it contains spurious _text attribute
    attributes["_text"] = nil
  elseif xtype == "DECL" and name =="xml" then
    -- the xml declaration attributes must be in a correct order
    local encoding = attributes.encoding or "utf-8"
    insert("<?xml version='%s' encoding='%s' ?>", attributes.version, encoding)
    return output
  elseif xtype == "CDATA" then
    -- return content unescaped
    insert("<![CDATA[%s]]>", text_content)
    return output
  end

  start(xtype, name, attributes)
  text(xtype,text_content) 
  local children = root._children or {}
  for _, child in ipairs(children) do
    output = serialize_dom(parser,child, level + 1, output)
  end
  stop(xtype, name)
  return output
end



local process = filter {
  function(text)
    local dom = domobject.parse(text)
    return table.concat(serialize_dom(dom))
  end
}

-- trick to insert this filter to the end
Make:match("xml$", function()
  Make:match("xml$", process)
end)

它的大部分代码都是序列化函数的副本luaxml-domobject.lua。我们只更改了一个函数,escape_element将字符转换为实体:

local escape_element = function(text)
  local t = {}
  for _, codepoint in codes(text) do
    if codepoint > 128 then
      t[#t+1] = string.format("&#x%x;", codepoint) 
    else
      local char = uchar(codepoint)
      t[#t+1] =  char -- escapes[char] or char
    end
  end
  local result = table.concat(t)
  return result
end

还有一个必要的技巧。您需要将此过滤器作为最后一个过滤器执行,因为其他 DOM 过滤器会将实体转换回字符。可以使用以下代码实现此目的:

-- trick to insert this filter to the end
Make:match("xml$", function()
  Make:match("xml$", process)
end)

还有另一个问题。TeX4ht 中 JATS 的当前配置可能会因某些 MathML 代码而失败。我已在源代码中修复了这个问题。在获得 TeX Live 更新之前,您可以使用此文件,tex4ht.usr以确保正确转换:

\Configure{jats}{%
   \Hinclude[*]{html4.4ht}% we will build upon HTML
   \Hinclude[*]{jats.4ht}%
   \Hinclude[*]{mathml.4ht}%
   \Hinclude[*]{html-mml.4ht}%
   \Hinclude[*]{unicode.4ht}%
}

结果如下:

<label>Chapter&#xa0;1</label><title id='x1-10001'> What is &#x201c;Software&#x201d;?</title>

相关内容