我正在使用这个答案生成一个相当复杂的文档的纯文本版本,用于拼写检查。这是我第一次尝试使用 lualatex,因此可能会出现很多问题,但大多数情况下它都能满足我的要求:
\documentclass{article}
\usepackage{luatexbase}
\usepackage{lipsum}
\usepackage{filecontents}
\usepackage{ifluatex}
\begin{filecontents*}{luaFunctions.lua}
-- clear the file
file = io.open("output.txt", "w")
file:write()
exportParagraph = false
function exportText (head)
if exportParagraph == false then
--if you return nil no pdf will be created
-- return nil
return head
end
-- open the file in append-modus
local out = io.open("output.txt", "a")
local wordCounter = 0
-- loop over all hboxes in the current paragraph
for line in node.traverse_id (node.id("hlist"), head) do
-- loop over each element in the line
for item in node.traverse (line.list) do
-- check if the element is a char
if item.id == node.id("glyph") then
out:write(string.char(item.char))
-- check if the element is a 'space'
elseif item.id == node.id("glue") then
wordCounter = wordCounter + 1
out:write(" ")
end
end
-- a newline in the file after each (tex)line
out:write("\n")
end
wordCounter = wordCounter - 1
out:write("Words: "..wordCounter.."\n")
-- a newline in the file after each paragraph
out:write("\n")
assert(out:close())
exportParagraph = false
--if you return nil no pdf will be created
-- return nil
return head
end
function disableLigatures(head)
-- disable ligatures
end
function SetExportParagraph(export)
exportParagraph = export
end
luatexbase.add_to_callback("ligaturing", disableLigatures, "disableLigatures")
luatexbase.add_to_callback("post_linebreak_filter", exportText, "exportText")
\end{filecontents*}
\ifluatex
\directlua{dofile("luaFunctions.lua")}
\fi
\def\exportParagraph{%
\ifluatex
\directlua{SetExportParagraph(true)}
\fi
}
\begin{document}
\exportParagraph
ff fi Lorem ipsum dolor sit amet, \textbf{consectetuer adipiscing elit. Ut purus elit,
vestibulum ut, placerat ac, adipiscing vitae, felis.} Curabitur dictum gravida
mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna.
Donec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus
et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra
metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus
eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium
quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean
faucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Cur-
abitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue
eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim
rutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrum.
Nam dui ligula, fringilla a, euismod sodales, sollicitudin vel, wisi. Morbi
auctor lorem non justo. Nam lacus libero, pretium at, lobortis vitae, ultricies et,
tellus. Donec aliquet, tortor sed accumsan bibendum, erat ligula aliquet magna,
vitae ornare odio metus a mi. Morbi ac orci et nisl hendrerit mollis. Suspendisse
ut massa. Cras nec ante. Pellentesque a nulla. Cum sociis natoque penatibus et
magnis dis parturient montes, nascetur ridiculus mus. Aliquam tincidunt urna.
Nulla ullamcorper vestibulum turpis. Pellentesque cursus luctus mauris.
\exportParagraph
Nulla malesuada porttitor diam. Donec felis erat, congue non, volutpat at,
tincidunt tristique, libero. Vivamus viverra fermentum felis. Donec nonummy
pellentesque ante. Phasellus adipiscing semper elit. Proin fermentum massa
ac quam. Sed diam turpis, molestie vitae, placerat a, molestie nec, leo. Mae-
cenas lacinia.
Nam ipsum ligula, eleifend at, accumsan nec, suscipit a, ipsum.
Morbi blandit ligula feugiat magna. Nunc eleifend consequat lorem. Sed lacinia
nulla vitae enim. Pellentesque tincidunt purus vel magna. Integer non enim.
Praesent euismod nunc eu purus. Donec bibendum quam in tellus. Nullam cur-
sus pulvinar lectus. Donec et mi. Nam vulputate metus eu enim. Vestibulum
pellentesque felis eu massa.
\end{document}
在生成的输出中,rutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrum
第一段末尾的无意义单词将被连字符连接:
[...]
ac, nulla. Cur- abitur auctor semper nulla. Donec varius orci eget risus. Duis
nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit
amet orci dignissim rutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrum-
rutrumrutrumrutrumrutrumrutrumrutrum.
Words: 134
这种情况发生在我的整个文本中,使得文本拼写检查变得相当困难。有没有办法完全禁用这种黑客攻击的连字(我犹豫着是否称之为解决方案)?
答案1
中存在多个节点处理回调luatex
,post_linebreak_filter
但这些回调并不适合你的目的,因为你必须处理分成几行的节点列表。更合适的是pre_linebreak_filter
,它在换行前调用。
我还在您的代码中发现了一些错误,这些错误在我尝试使用fontspec
包和一些非 ascii 字符时显示出来。首先,我将发布修改后的文件:
\documentclass{article}
\usepackage{luatexbase}
\usepackage{fontspec}
%\setmainfont{TeX Gyre Schola}
\usepackage{lipsum}
\usepackage{filecontents}
\usepackage{ifluatex}
\begin{filecontents*}{luaFunctions.lua}
-- clear the file
local file = io.open("output.txt", "w")
file:write()
file:close()
local char = unicode.utf8.char
exportParagraph = false
function exportText (head, listtype)
--[[
-- it is better to solve this using attributes
if exportParagraph == false then
--if you return nil no pdf will be created
-- return nil
return head
end --]]
-- open the file in append-modus
local out = io.open("output.txt", "a")
local wordCounter = 0
local charcount = 0
local function traverse(h)
local word = false
for item in node.traverse (h) do
local skip = node.has_attribute(item,
luatexbase.attributes.wordcounton)
if skip == 2 then
-- check if the element is a char
if item.id == node.id("glyph") then
if node.is_node(item.components) then
traverse(item.components)
else
out:write(char(item.char))
charcount = charcount + 1
word = true
end
elseif
item.id == node.id("hlist")
or item.id == node.id("vlist")
or item.id == node.id("insert")
or item.id == node.id("adjust")
then
-- out:write(item.id..","..item.subtype.."[")
traverse(item.head)
-- out:write "]"
-- check if the element is a 'glue'. this means not only space
elseif item.id == node.id("glue") and item.subtype == 0 then
-- glue nodes doesn't have to be spaces, count only after word
if word then
wordCounter = wordCounter + 1
charcount = charcount + 1
end
word = false
out:write(" ")
end
end
end
-- if word then wordCounter = wordCounter + 1 end
end
-- loop over all hboxes in the current paragraph
--for line in node.traverse_id (node.id("hlist"), head) do
-- loop over each element in the line
traverse(head)
-- a newline in the file after each (tex)line
out:write("\n")
--end
-- wordCounter = wordCounter - 1
out:write("Words: "..wordCounter)
out:write(", characters: "..charcount)
out:write(", list type: "..listtype.."\n")
-- a newline in the file after each paragraph
out:write("\n")
assert(out:close())
--exportParagraph = false
--if you return nil no pdf will be created
-- return nil
return head
end
function disableLigatures(head)
-- disable ligatures
end
function SetExportParagraph(export)
exportParagraph = export
end
luatexbase.add_to_callback("ligaturing", disableLigatures, "disableLigatures")
luatexbase.add_to_callback("pre_linebreak_filter", exportText, "exportText")
\end{filecontents*}
\ifluatex
\newluatexattribute\wordcounton
\directlua{dofile("luaFunctions.lua")}
\fi
\def\startExportParagraph{%
\ifluatex
\wordcounton = 2
%\directlua{SetExportParagraph(true)}
\fi
}
\def\stopExportParagraph{%
\ifluatex
\wordcounton = 1
\fi
}
\begin{document}
\startExportParagraph
\noindent
ff fi Lorem ipsum dolor sit amet, příliš žluťoučký text s diakritikou
dash\footnote{you should test some options} -- \hbox{how does that work?}
\textbf{consectetuer adipiscing elit. Ut purus elit,
vestibulum ut, placerat ac, adipiscing vitae, felis.} Curabitur dictum gravida
mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna.
Donec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus
et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra
metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus
eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium
quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean
faucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue
eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim
rutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrumrutrum.
\begin{tabular}{ll}
what & about\\
tables&?
\end{tabular}
\begin{itemize}
\item you also want to save itemize
\item items
\end{itemize}
You can \stopExportParagraph stop word countinh in the middle of \startExportParagraph the paragraph.
Nam dui ligula, fringilla a, euismod sodales, sollicitudin vel, wisi. Morbi
auctor lorem non justo. Nam lacus libero, pretium at, lobortis vitae, ultricies et,
tellus. Donec aliquet, tortor sed accumsan bibendum, erat ligula aliquet magna,
vitae ornare odio metus a mi. Morbi ac orci et nisl hendrerit mollis. Suspendisse
ut massa. Cras nec ante. Pellentesque a nulla. Cum sociis natoque penatibus et
magnis dis parturient montes, nascetur ridiculus mus. Aliquam tincidunt urna.
Nulla ullamcorper vestibulum turpis. Pellentesque cursus luctus mauris.
\stopExportParagraph
Nulla malesuada porttitor diam. Donec felis erat, congue non, volutpat at,
tincidunt tristique, libero. Vivamus viverra fermentum felis. Donec nonummy
pellentesque ante. Phasellus adipiscing semper elit. Proin fermentum massa
ac quam. Sed diam turpis, molestie vitae, placerat a, molestie nec, leo. Mae-
cenas lacinia.
Nam ipsum ligula, eleifend at, accumsan nec, suscipit a, ipsum.
Morbi blandit ligula feugiat magna. Nunc eleifend consequat lorem. Sed lacinia
nulla vitae enim. Pellentesque tincidunt purus vel magna. Integer non enim.
Praesent euismod nunc eu purus. Donec bibendum quam in tellus. Nullam cur-
sus pulvinar lectus. Donec et mi. Nam vulputate metus eu enim. Vestibulum
pellentesque felis eu massa.
\end{document}
使用了全局变量file
,与 中的某些变量发生干扰fontspec
。所有私有变量都应该local
!。file
也没有关闭。
在处理unicode字符时,我们不能使用string.char
函数,但必须使用unicode.utf8.char
。
然后我将节点遍历循环重写为递归函数,因为节点列表中可能出现子列表,我们也必须处理它们。参见traverse
函数。
修改了文档接口,引入了两个宏:startExportParagraph
和stopExportParagraph
。luatex
使用节点属性机制,可以更灵活地切换计数,即使在段落中间也可以。还增加了字符计数。
我添加了一些测试用例:
ff fi Lorem ipsum dolor sit amet, příliš žluťoučký text s diakritikou
dash\footnote{you should test some options} -- \hbox{how does that work?}
\textbf{consectetuer adipiscing elit. Ut purus elit,...
\begin{tabular}{ll}
what & about\\
tables&?
\end{tabular}
\begin{itemize}
\item you also want to save itemize
\item items
\end{itemize}
You can \stopExportParagraph stop word countinh in the middle of \startExportParagraph the paragraph.
保存为output.txt
:
1you should test some options
Words: 4, characters: 29, list type: insert
ff fi Lorem ipsum dolor sit amet, příliš žluťoučký text s diakritikou dash1 -- how does that work? consectetuer adipiscing elit.
what about tables ?
Words: 4, characters: 20, list type:
• you also want to save itemize
Words: 7, characters: 30, list type:
• items
Words: 2, characters: 6, list type:
You can the paragraph.
Words: 4, characters: 22, list type:
如您所见,脚注会生成自己的段落,该段落显示在它们所在的段落之前。连字被拆分为 int 部分,因此ffi
和fi
被正确计算。但这也导致破折号被拆分为--
。 itemize 环境中的项目符号被计为单词,我必须研究如何解决这个问题。此外,字符数统计也是错误的。
答案2
如果你不是在寻找 lua 解决方案(毫无疑问这是可能的),你可以使用经典的 Tex 版本
\begin{document}\language-1
将关闭连字