利用 TeX 编译器编写 TeX 到 UTF8 转换器

Question

首先，如果您喜欢ĄąĆćĘę£łŃńÓóŚś-źŻż在.tex文件中输入，那么您可以直接输入（或粘贴）文件。您只需要\usepackage[utf8]{inputenc}使用 pdfTeX，或者使用支持 Unicode 的引擎（XeTeX 或 LuaTeX）即可。例如，以下内容有效（使用编译时xelatex）：

\documentclass{article}
\begin{document}
ĄąĆćĘę£łŃńÓóŚś-źŻż
\end{document}

如果问题是你没有方便（或容易记住）的键盘布局来输入这些内容，所以你更愿意使用 TeX 宏来输入（但仍然希望文件包含上述字符），那么这只需设置你的编辑器或输入系统即可。例如（建议在评论由用户 Loop Space 编写的 Emacs 可以做到这一点，方法是M-x set-input-method RET TeX：当您按下键盘上的键时\=o，输入到文件中的是ō。您不必使用 Emacs；这种功能在 UIM 等输入法中也可用（例子）。

因此，如果您正在创建文件，我认为没有理由使用 TeX 本身进行这种转换.tex：最好首先找到一种插入您喜欢的字符的方法。

.tex但是，如果您正在使用其他人创建的文件（并且您可以更改该文件），或者在您有此偏好之前您自己创建的文件，那么这个问题可能有意义。

使用 TeX（而不是在编辑器中简单地搜索和替换）的主要好处是能够知道宏的定义何时\L发生\O变化。这也是问题中说明的问题。

因此，为了解决这个问题，我使用内省（又名反光的) LuaTeX 所具有的功能：具体来说，token.get_macro它让我们能够看到宏的定义，以及process_input_buffer回调，它让我们能够检查每一行输入（并根据需要进行更改）。这个想法是：

在文本开始之前，记录所有已知字符替换宏（\L、\"、\c等）的“原始”定义。这让我们知道它们何时被重新定义。
对于输入中的每一行，查找该行中出现的宏，检查它们的定义是否没有改变，并且（如果是的话）用适当的替代方案替换它们和它们的参数。

因此，使用问题中的示例，在名为 say 的文件中mwe.tex：

\documentclass{article}
\directlua{dofile('rewrite.lua')}

\newcommand\zzz{hello}

\begin{document}

\L\"{o}\"{o}\c{k} \zzz

\renewcommand\L{LLL}
\renewcommand\"[1]{#1#1}
\renewcommand\c{c}

\L\"{o}\"{o}\c{k} \zzz

\end{document}

（请注意\directlua{dofile(...)}添加的行），您可以运行lualatex mwe.tex（一些行被剪断）：

9:41:29:~/tmp% lualatex mwe.tex
This is LuaTeX, Version 1.0.4 (TeX Live 2017) 
...
The original definition of #\L# is \TU-cmd \L \TU\L 
The original definition of #\c# is \TU-cmd \c \TU\c 
The original definition of #\"# is \TU-cmd \"\TU\" 
...
Processing line: \begin{document}
 --> Rewrote line to \begin{document}
...
Processing line: \L\"{o}\"{o}\c{k} \zzz
 --> Rewrote line to Łööķ \zzz
Processing line: 
 --> Rewrote line to 
Processing line: \renewcommand\L{LLL}
 ^ This line contains a \def or \newcommand or \renewcommand. Not rewriting.
...
Processing line: \L\"{o}\"{o}\c{k} \zzz
 --> Rewrote line to \L\"{o}\"{o}\c{k} \zzz

您将找到一个mwe.rewritten.tex包含以下内容的文件：

\newcommand\zzz{hello}

\begin{document}
\relax

Łööķ \zzz

\renewcommand\L{LLL}
\renewcommand\"[1]{#1#1}
\renewcommand\c{c}

\L\"{o}\"{o}\c{k} \zzz

\end{document}
\relax

您可以看到，只有应该发生的替换才发生。rewrite.lua上面实现此操作的 Lua 文件（称为上面）是：

print('')
rewritten_file = io.open(tex.jobname .. '.rewritten.tex', 'w')

funny_noarg = {
   ["\\L"] = "Ł",
   -- Define similarly for \oe \OE \ae \AE \aa \AA \o \O \l \i \j
}
funny_nonletter = {
   ['\\"'] = function(c) return c .. "̈" end,
   -- Define similarly for \` \' \^ \~ \= \.
}
funny_letter = {
   ["\\c"] = function(c) return c .. "̧" end,
   -- Define similarly for \u \v \H \c \d \b \t
}

orig_defs = {}
function populate_orig_defs()
   function set_def(s)
      definition = token.get_macro(s:sub(2))
      orig_defs[s] = definition
      print('The original definition of #' .. s .. '# is ' .. definition)
   end
   for s, v in pairs(funny_noarg) do set_def(s) end
   for s, v in pairs(funny_letter) do set_def(s) end
   for s, v in pairs(funny_nonletter) do set_def(s) end
end
populate_orig_defs()

function literalize(s)
   -- The string s, with special characters escaped, in a format safe for using inside gsub.
   -- https://stackoverflow.com/questions/1745448/lua-plain-string-gsub#comment18401212_1746473
   return s:gsub("[%(%)%.%%%+%-%*%?%[%]%^%$]", "%%%0")
end
function replace(s)
   print('Processing line: ' .. s)
   if s:find([[\def]]) ~= nil or s:find([[\newcommand]]) ~= nil or s:find([[\renewcommand]]) ~= nil then
      print(' ^ This line contains a \\def or \\newcommand or \\renewcommand. Not rewriting.')
     rewritten_file:write(s .. '\n')
     return nil
   end
   for k, v in pairs(funny_noarg) do
      -- followed by a nonletter. TODO: Can use the catcode tables.
      if token.get_macro(k:sub(2)) == orig_defs[k] then
         s = s:gsub(literalize(k) .. '([^a-zA-Z])', function(capture) return v .. capture end)
      end
   end
   for k, v in pairs(funny_letter) do
      -- followed by a letter inside {}. TODO: Can use the catcode tables, also can support \c c, for example.
      if token.get_macro(k:sub(2)) == orig_defs[k] then
         s = s:gsub(literalize(k) .. '{(.)}', v)
      end
   end
   for k, v in pairs(funny_nonletter) do
      -- followed by a letter inside {}. TODO: We could also support \"o for example.
      if token.get_macro(k:sub(2)) == orig_defs[k] then
         s = s:gsub(literalize(k) .. '{(.)}', v)
      end
   end
   print(' --> Rewrote line to ' .. s)
   rewritten_file:write(s .. '\n')
   return nil
end

luatexbase.add_to_callback('process_input_buffer', replace, 'Replace some macros with UTF-8 equivalents')

由于这只是一个概念验证，而不是一个生产质量系统，所以我采取了一些捷径，如果您有兴趣采用这种方法，可以填写：

仅列出了 TeX 的一些重音符号或特殊字符宏的 Unicode 等效项
您需要重新插入该\documentclass{article}行（实际上，该\directlua{dofile(…)}行之前的内容也需要重新插入）。（为了好玩，您可以尝试移动该行前 \documentclass看看会发生什么。
您可能希望将此行放在所有\usepackage行之后，也许在的开头\begin{document}。（如果您尝试过上述方法，您就会知道为什么。）
您需要删除\relax末尾的行（我们可能可以让它不出现……）
它假设输入文件包含 LaTeX 约定\={o}而不是\=o；再多几行我们也可以支持后者。同样，如果\c{k}我们有\c k或\c {k}等，而不是。
它会完全忽略（不替换任何内容）包含\def或的行\newcommand；相反，如果我们愿意（如果输入文件写得很糟糕！），我们可以直接跳到或的末尾\def，然后处理其余部分。
它假设（要知道像这样的控制序列何时\o结束）“字母”是a-zA-Z；您可能想要添加@到该列表中，实际上我们可以在当时活动的 catcode 机制下使用“字母”的精确定义 - LuaTeX 也提供了这一点。

请注意，即使您通常使用 pdfTeX 或 XeTeX 编译文件，您也可以使用 LuaTeX 进行此转换，然后在转换后的文件上继续使用 pdfTeX/XeTeX。

Answer 1