如何在段落节点前后插入 PDF 文字。我的代码无法正常工作。感谢大家的帮助。
\documentclass{article}
\usepackage{luacode}
\usepackage{polyglossia}
\setmainlanguage[babelshorthands=true]{russian}
\setmainfont{Times New Roman}
\pagestyle{empty}
\thispagestyle{empty}
\begin{luacode}
local pdf_node=node.new("whatsit","pdf_literal")
local pdf_node0=node.new("whatsit","pdf_literal")
pdf_node.data='/Span <</ActualText<FEFF041F04400438043204350442>>> BDC'
pdf_node0.data='EMC'
pdf_node.mode=2
pdf_node0.mode=2
local function insert_local_par(par,b)
par=node.insert_before(par,par,pdf_node)
par=node.insert_after(par,node.tail(par),pdf_node0)
end
luatexbase.add_to_callback("insert_local_par",insert_local_par,"insert_local_par")
\end{luacode}
\begin{document}
Test
hello
\newpage
par
new par
\end{document}
答案1
您无需在回调中执行此操作,而是可以检查发货箱并搜索local_par
。
\documentclass{article}
\usepackage{atbegshi}
\usepackage{luacode}
\begin{luacode}
local BDC = node.new("whatsit","pdf_literal")
local EMC = node.new("whatsit","pdf_literal")
BDC.data = '/Span <</ActualText<FEFF041F04400438043204350442>>> BDC'
BDC.mode = 2
EMC.data = 'EMC'
EMC.mode = 2
function tag_local_par(parent, level)
local head = parent.list
while head do
-- texio.write_nl(string.rep(" ", level) .. tostring(head))
if head.id == node.id"hlist" or head.id == node.id"vlist" then
if head.list and head.list.id == node.id"local_par" then
local par_list = head.list and head.list.next -- local_par should always be followed by a list
head.list = node.insert_before(head.list, head.list, node.copy(BDC))
par_list = node.insert_after(par_list, node.tail(par_list), node.copy(EMC))
end
tag_local_par(head, level + 1)
end
head = head.next
end
end
\end{luacode}
\AtBeginShipout{\directlua{tag_local_par(tex.box["AtBeginShipoutBox"], 0)}}%
\begin{document}
Test
hello
\newpage
par
new par
\end{document}
然后,pdftotext
我得到
Привет
Привет
1
Привет
Привет
2
可以通过递归处理节点列表来更好地标记段落及其内容,如下所示https://tex.stackexchange.com/a/495230。目前,这发生在post_linebreak_filter
回调中,这是插入 whatsits 的正确回调。但是,此时连字、字距调整和换行都已完成,我们最终在节点列表中得到了这些过程的各种细节。这就是为什么输出包含杂散的连字符和连字符,而这些连字符和连字符本不应该存在。
正确的方法是转换hyphenate
回调中的扫描段落,然后使用节点属性等进行一些簿记,然后收集post_linebreak_filter
所有这些信息并放置/ActualText
。
\documentclass{article}
\usepackage{luacode}
\begin{luacode}
local converters = {}
local function convert(n)
local id = n.id
local type = node.type(id)
local typeconv = converters[type]
if typeconv then
return typeconv(n) or ""
else
texio.write_nl("tag_par warning: no conversion available for " .. type)
return ""
end
end
function converters.hlist(n)
local text = {}
for n in node.traverse(n.list) do
text[#text + 1] = convert(n)
end
return table.concat(text)
end
function converters.glyph(n)
return utf.char(n.char)
end
function converters.glue(n)
-- FIXME: any glue is treated like space
return " "
end
function converters.kern(n)
-- FIXME: any kern is just dropped
return ""
end
function converters.disc(n)
-- FIXME: does anybody care about discretionaries? Can we even distinguish
-- user and hyphenation ones?
local subtype = node.subtypes(n.id)[n.subtype]
if subtype == "automatic" then
return convert(n.replace)
end
return ""
end
local function tag_par(head, groupcode)
local text = {}
for n in node.traverse(head) do
text[#text + 1] = convert(n)
end
local actual_text = table.concat(text)
actual_text = string.gsub(actual_text, " +", " ") -- collapse consecutive spaces
actual_text = string.gsub(actual_text, "^%s*(.-)%s*$", "%1") -- trim surrounding spaces
local BDC = node.new("whatsit", "pdf_literal")
BDC.data = "/Span <</ActualText(<p>" .. actual_text .. "</p>)>> BDC"
BDC.mode = 2
head = node.insert_before(head, head, BDC)
local EMC = node.new("whatsit", "pdf_literal")
EMC.data = "EMC"
EMC.mode = 2
head = node.insert_after(head, node.tail(head), EMC)
return head
end
luatexbase.add_to_callback("post_linebreak_filter", tag_par, "tag_par")
\end{luacode}
\begin{document}
\input{lorem.tex}
\input{knuth.tex}
\newpage
\input{ward.tex}
\input{zapf.tex}
\end{document}
为了便于说明,我用类似 HTML 的标签标记了段落<p>...</p>
。从输出中可以看出pdftotext
:
$ pdftotext test.pdf -
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur massa turpis, semper quis fringilla ut, viverra nec risus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Donec nunc lorem, sollicitudin vel sodales eget, vehicula nec mi. Proin ullamcorper rutrum nibh, at porttitor nunc euismod et. Donec faucibus nisi faucibus ipsum porttitor pharetra. Sed elementum, lectus nec congue imperdiet, ipsum leo viverra nisi, sit amet commodo odio odio id nisl. Fusce sagittis lobortis nisi sed consectetur. Nam egestas, sem ut fermentum convallis, ipsum tellus venenatis augue, eget condimentum risus quam id erat. Sed metus dui, sollicitudin pharetra pellen- tesque sed, placerat eget augue. Mauris sodales pretium tortor vitae rutrum. Proin quam sem, lobortis tincidunt pretium vitae, feugiat eu lacus.</p>
<p>Thus, I came to the conclusion that the designer of a new system must not only be the implementer and strst large||scale user; the designer should also write the strst user manual.</p>
<p>The separation of any of these four components would have hurt TEX signif- icantly. If I had not participated fully in all these activities, literally hundreds of improvements would never have been made, because I would never have thought of them or perceived why they were important.</p>
<p>But a system cannot be successful if it is too strongly in﬇uenced by a single person. Once the initial design is complete and fairly robust, the real test begins as people with many dierent viewpoints undertake their own experiments.</p>
1
<p>The Earth, as a habitat for animal life, is in old age and has a fatal illness. Several, in fact. It would be happening whether humans had ever evolved or not. But our presence is like the eect of an old|-|age patient who smokes many packs of cigarettes per day |=| and we humans are the cigarettes.</p>
<p>Coming back to the use of typefaces in electronic publishing: many of the new typographers receive their knowledge and information about the rules of typography from books, from computer magazines or the instruction manuals which they get with the purchase of a PC or software. There is not so much basic instruction, as of now, as there was in the old days, showing the dierences between good and bad typographic design. Many people are just fascinated by their PCâ•Žs tricks, and think that a widely||praised program, called up on the screen, will make everything automatic from now on.</p>
2