使用 fontspec 的 Node 渲染器时,梵文 UTF-8 文本提取代码不会产生 lua 错误,但由于 Node 渲染器中可能存在错误,因此仍然无法产生正确的结果。我已将其作为单独的问题提出:LuaTeX:天城体字形顺序在 tex 的内部节点列表中被反转,如何在遍历字形节点时恢复正确顺序?
在尝试使用不同的技术从 TeX 框中提取 UTF-8 文本时,我发现使用 fontspec 的 Node 渲染器提取天城文文本时不会产生任何 lua 错误的两种技术,在使用 fontspec 的 HarfBuzz 渲染器(Renderer = Harfbuzz,Renderer=OpenType)时都会产生 lua 错误。
这两种技术的详细信息如下:技术-1(使用micahl-h21 是函数get_unicode
)以及这里:技术-2(仅应用于unicode.utf8.char
复杂字形的组成部分)。我尝试了多种梵文字体,结果都是一样的。
这两种技术的完整测试代码及其各自的错误签名在下面的块中逐一列出。对于我的示例,我使用了可在此处免费获得的 Noto Sans Devanagari(常规粗细):链接到 Google 字体 GitHub 上的 Noto Sans Devanagari
技术-1使用 Devanagari 和 HarfBuzz (如果使用 Node 渲染器编译则没有 lua 错误):
\documentclass{article}
\usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
\usepackage{fontspec}
\usepackage{microtype}
%\newfontscript{Devanagari}{deva,dev2}
\newfontfamily{\devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz]
\begin{document}
% Devanagari text is at the right end of following line
% of code, you might have to scroll right to read it
\setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devanagarifam एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
\directlua{
% local fontstyles = require "l4fontstyles"
local char = unicode.utf8.char
local glyph_id = node.id("glyph")
local glue_id = node.id("glue")
local hlist_id = node.id("hlist")
local vlist_id = node.id("vlist")
local disc_id = node.id("disc")
local minglue = tex.sp("0.2em")
local usedcharacters = {}
local identifiers = fonts.hashes.identifiers
local function get_unicode(xchar,font_id)
local current = {}
local uchar = identifiers[font_id].characters[xchar].tounicode
for i= 1, string.len(uchar), 4 do
local cchar = string.sub(uchar, i, i + 3)
print(xchar,uchar,cchar, font_id, i)
table.insert(current,char(tonumber(cchar,16)))
end
return current
end
local function nodeText(n)
local t = {}
for x in node.traverse(n) do
% glyph node
if x.id == glyph_id then
% local currentchar = fonts.hashes.identifiers[x.font].characters[x.char].tounicode
local chars = get_unicode(x.char,x.font)
for _, current_char in ipairs(chars) do
table.insert(t,current_char)
end
% glue node
elseif x.id == glue_id and node.getglue(x) > minglue then
table.insert(t," ")
% discretionaries
elseif x.id == disc_id then
table.insert(t, nodeText(x.replace))
% recursivelly process hlist and vlist nodes
elseif x.id == hlist_id or x.id == vlist_id then
table.insert(t,nodeText(x.head))
end
end
return table.concat(t)
end
local n = tex.getbox(0)
print(nodeText(n.head))
local f = io.open("hello.txt","w")
f:write(nodeText(n.head))
f:close()
}
\box0
\end{document}
技术 1 (HarfBuzz 渲染器) 的错误签名:
[\directlua]:1: bad argument #1 to 'len' (string expected, got nil)
stack traceback:
[C]: in function 'string.len'
[\directlua]:1: in upvalue 'get_unicode'
[\directlua]:1: in local 'nodeText'
[\directlua]:1: in main chunk.
l.62 }
技术-2使用 Devanagari 和 HarfBuzz (如果使用 Node 渲染器编译则没有 lua 错误):
\documentclass{article}
\usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
\usepackage{fontspec}
\usepackage{microtype}
%\newfontscript{Devanagari}{deva,dev2}
\newfontfamily{\devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz]
\begin{document}
% Devanagari text is at the right end of following line
% of code, you might have to scroll right to read it
\setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devanagarifam एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
\directlua{
local glyph_id = node.id("glyph")
local disc_id = node.id("disc")
local glue_id = node.id("glue")
local hlist_id = node.id("hlist")
local vlist_id = node.id("vlist")
local minglue = tex.sp("0.2em")
local function nodeText(n)
local t = {}
for x in node.traverse(n) do
% glyph node
if x.id == glyph_id then
if bit32.band(x.subtype,2) \csstring~=0 and unicode.utf8.char(x.char) \csstring~="“" and unicode.utf8.char(x.char) \csstring~="”" then %
for g in node.traverse_id(glyph_id,x.components) do
if bit32.band(g.subtype, 2) \csstring~=0 then
for gc in node.traverse_id(glyph_id,g.components) do
table.insert(t,unicode.utf8.char(gc.char))
end
else
table.insert(t,unicode.utf8.char(g.char))
end
end
else
table.insert(t,unicode.utf8.char(x.char))
end
% disc node
elseif x.id == disc_id then
for g in node.traverse_id(glyph_id,x.replace) do
if bit32.band(g.subtype, 2) \csstring~=0 then
for gc in node.traverse_id(glyph_id,g.components) do
table.insert(t,unicode.utf8.char(gc.char))
end
else
table.insert(t,unicode.utf8.char(g.char))
end
end
% glue node
elseif x.id == glue_id and node.getglue(x) > minglue then
table.insert(t," ")
elseif x.id == hlist_id or x.id == vlist_id then
table.insert(t,nodeText(x.head))
end
end
return table.concat(t)
end
local n = tex.getbox(0)
print(nodeText(n.head))
local f = io.open("hello.txt","w")
f:write(nodeText(n.head))
f:close()
}
\box0
\end{document}
技术 2 (HarfBuzz 渲染器) 的错误签名:
[\directlua]:1: bad argument #1 to 'char' (invalid value)
stack traceback:
[C]: in field 'char'
[\directlua]:1: in local 'nodeText'
[\directlua]:1: in main chunk.
l.64 }
答案1
在节点模式下,通常无法恢复完整文本,因为您得到的是成形输出,而成形字形无法唯一地映射回输入文本。您只能使用 tounicode 值来近似它。这些映射到实际的 PDF 文件 ToUnicode CMap 条目,因此遵循其字形到 Unicode 映射的受限模型:每个字形都相当于一个固定的 unicode 代码点序列。这些映射按渲染顺序连接起来。如您所见,此模型不足以将 Devanagari 字形映射到输入文本。
您可以使用harf
模式来避免此问题:harf
模式不受此有限模型的影响,因为它不仅为您提供了形状的字形列表,而且还创建了 PDF 标记内容 ActualText 条目,这些条目覆盖了无法通过 ToUnicode 正确建模的序列中的 ToUnicode 映射。可以使用属性从 Lua 代码中查询此映射所需的数据glyph_data
。(这是一个未记录的实现细节,将来可能会发生变化)
如果您想从任何文本中提取尽可能多的内容,您可以在 Lua 代码中结合此基于属性的方法和基于 ToUnicode 的方法:
extracttext.lua
使用以下方式创建文件
local type = type
local char = utf8.char
local unpack = table.unpack
local getproperty = node.getproperty
local getfont = font.getfont
local is_glyph = node.is_glyph
-- tounicode id UTF-16 in hex, so we need to handle surrogate pairs...
local utf16hex_to_utf8 do -- Untested, but should more or less work
local l = lpeg
local tonumber = tonumber
local hex = l.R('09', 'af', 'AF')
local byte = hex * hex
local simple = byte * byte / function(s) return char(tonumber(s, 16)) end
local surrogate = l.S'Dd' * l.C(l.R('89', 'AB', 'ab') * byte)
* l.S'Dd' * l.C(l.R('CF', 'cf') * byte) / function(high, low)
return char(0x10000 + ((tonumber(high, 16) & 0x3FF) << 10 | (tonumber(low, 16) & 0x3FF)))
end
utf16hex_to_utf8 = l.Cs((surrogate + simple)^0)
end
-- First the non-harf case
-- Standard caching setup
local identity_table = setmetatable({}, {__index = function(_, id) return char(id) end})
local cached_text = setmetatable({}, {__index = function(t, fid)
local fontdir = getfont(fid)
local characters = fontdir and fontdir.tounicode == 1 and fontdir.characters
local font_cache = characters and setmetatable({}, {__index = function(tt, slot)
local character = characters[slot]
local text = character and character.tounicode or slot
-- At this point we have the tounicode value in text. This can have different forms.
-- The order in the if ... elseif chain is based on how likely it is to encounter them.
-- This is a small performance optimization.
local t = type(text)
if t == 'string' then
text = utf16hex_to_utf8:match(text)
elseif t == 'number' then
text = char(text)
elseif t == 'table' then
text = char(unpack(text)) -- I haven't tested this case, but it should work
end
tt[slot] = text
return text
end}) or identity_table
t[fid] = font_cache
return font_cache
end})
-- Now the tounicode case just has to look up the value
local function from_tounicode(n)
local slot, fid = is_glyph(n)
return cached_text[fid][slot]
end
-- Now the traversing stuff. Nothing interesting to see here except for the
-- glyph case
local traverse = node.traverse
local glyph, glue, disc, hlist, vlist = node.id'glyph', node.id'glue', node.id'disc', node.id'hlist', node.id'vlist'
local extract_text_vlist
-- We could replace i by #t+1 but this should be slightly faster
local function real_extract_text(head, t, i)
for n, id in traverse(head) do
if id == glyph then
-- First handle harf mode: Look for a glyph_info property. If that does not exists
-- use from_tounicode. glyph_info will sometimes/often be an empty string. That's
-- intentional and it should *not* trigger a fallback. The actual mapping will be
-- contained in surrounding chars.
local props = getproperty(n)
t[i] = props and props.glyph_info or from_tounicode(n)
i = i + 1
elseif id == glue then
if n.width > 1001 then -- 1001 is arbitrary but sufficiently high to be bigger than most weird glue uses
t[i] = ' '
i = i + 1
end
elseif id == disc then
i = real_extract_text(n.replace, t, i)
elseif id == hlist then
i = real_extract_text(n.head, t, i)
elseif id == vlist then
i = extract_text_vlist(n.head, t, i)
end
end
return i
end
function extract_text_vlist(head, t, i) -- glue should not become a space here
for n, id in traverse(head) do
if id == hlist then
i = real_extract_text(n.head, t, i)
elseif id == vlist then
i = extract_text_vlist(n.head, t, i)
end
end
return i
end
return function(list)
local t = {}
real_extract_text(list.head, t, 1)
return table.concat(t)
end
这可以用作普通的 Lua 模块:
\documentclass{article}
\usepackage{fontspec}
\newfontfamily{\devharf}{Noto Sans Devanagari}[Script=Devanagari, Renderer=HarfBuzz]
\newfontfamily{\devnode}{Noto Sans Devanagari}[Script=Devanagari, Renderer=Node]
\begin{document}
% Devanagari text is at the right end of following line
% of code, you might have to scroll right to read it
\setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devharf एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
\setbox1=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devnode एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
\directlua{
local extracttext = require'extracttext'
local f = io.open("hello.harf.txt","w") % Can reproduce the full input text
f:write(extracttext(tex.getbox(0)))
f:close()
f = io.open("hello.node.txt","w") % In node mode, we only get an approximation
f:write(extracttext(tex.getbox(1)))
f:close()
}
\box0
\box1
\end{document}
更一般的说明:如您所见,从形状列表中获取文本时需要做一些工作,尤其是在 ToUnicode 情况下,我们必须映射代理对等。这主要是因为形状文本是不是旨在用于此类用途。一旦字形节点受到保护(又称 subtype(n) >= 256 或not is_char(n)
为true
),.char
条目就不再包含 Unicode 值,而是内部标识符,.font
条目可能不再是您期望的值,并且某些字形可能根本不表示为字形。在大多数情况下,如果您想要实际访问框后面的文本而不仅仅是文本的视觉显示,您确实希望拦截列表前它首先被成形。
答案2
我不太了解 Luaotfload 如何处理 HarfBuzz 字体,但tounicode
多亏了 ,我找到了获取字段的方法table.serialize
。所以我为 Harfbuzz 改编的原始代码如下所示:
\documentclass{article}
\usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
\usepackage{fontspec}
\usepackage{microtype}
\usepackage{luacode}
%\newfontscript{Devanagari}{deva,dev2}
\newfontfamily{\devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz]
\newfontfamily{\arabicfam}{Amiri}[Script=Arabic, Scale=1, Renderer=HarfBuzz]
\begin{document}
\setbox0=\hbox{Příliš žluťoučký \textit{kůň} úpěl \hbox{ďábelské} ódy difference diffierence. \devanagarifam एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
\setbox1=\hbox{\arabicfam \textdir TRT هذه المقالة عن براغ. لتصفح عناوين مشابهة، انظر براغ (توضيح).}
\begin{luacode*}
-- local fontstyles = require "l4fontstyles"
local char = unicode.utf8.char
local glyph_id = node.id("glyph")
local glue_id = node.id("glue")
local hlist_id = node.id("hlist")
local vlist_id = node.id("vlist")
local disc_id = node.id("disc")
local minglue = tex.sp("0.2em")
local usedcharacters = {}
local identifiers = fonts.hashes.identifiers
local fontcache = {}
local function to_unicode_chars(uchar)
local uchar = uchar or ""
-- put characters into a table
local current = {}
-- each codepoint is 4 bytes long, we loop over tounicode entry and cut it into 4 bytes chunks
for i= 1, string.len(uchar), 4 do
local cchar = string.sub(uchar, i, i + 3)
-- codepoint is hex string, we need to convert it to number ad then to UTF8 char
table.insert(current,char(tonumber(cchar,16)))
end
return current
end
-- cache character lookup, to speed up things
local function get_character_from_cache(xchar, font_id)
local current_font = fontcache[font_id] or {characters = {}}
fontcache[font_id] = current_font -- initialize font cache for the current font if it doesn't exist
return current_font.characters[xchar]
end
-- save characters to cache for faster lookup
local function save_character_to_cache(xchar, font_id, replace)
fontcache[font_id][xchar] = replace
-- return value
return replace
end
local function initialize_harfbuzz_cache(font_id, hb)
-- save some harfbuzz tables for faster lookup
local current_font = fontcache[font_id]
-- the unicode data can be in two places
-- 1. hb.shared.glyphs[glyphid].backmap
current_font.glyphs = current_font.glyphs or hb.shared.glyphs
-- 2. hb.shared.unicodes
-- it contains mapping between Unicode and glyph id
-- we must create new table that contains reverse mapping
if not current_font.backmap then
current_font.backmap = {}
for k,v in pairs(hb.shared.unicodes) do
current_font.backmap[v] = k
end
end
-- save it back to the font cache
fontcache[font_id] = current_font
return current_font.glyphs, current_font.backmap
end
local function get_unicode(xchar,font_id)
-- try to load character from cache first
local current_char = get_character_from_cache(xchar, font_id)
if current_char then return current_char end
-- get tounicode for non HarfBuzz fonts
local characters = identifiers[font_id].characters
local uchar = characters[xchar].tounicode
-- stop processing if tounicode exists
if uchar then return save_character_to_cache(xchar, font_id, to_unicode_chars(uchar)) end
-- detect if font is processed by Harfbuzz
local hb = identifiers[font_id].hb
-- try HarfBuzz data
if not uchar and hb then
-- get glyph index of the character
local index = characters[xchar].index
-- load HarfBuzz tables from cache
local glyphs, backmap = initialize_harfbuzz_cache(font_id, hb)
-- get tounicode field from HarfBuzz glyph info
local tounicode = glyphs[index].tounicode
if tounicode then
return save_character_to_cache(xchar, font_id, to_unicode_chars(tounicode))
end
-- if this fails, try backmap, which contains mapping between glyph index and Unicode
local backuni = backmap[index]
if backuni then
return save_character_to_cache(xchar, font_id, {char(backuni)})
end
-- if this fails too, discard this character
return save_character_to_cache(xchar, font_id, {})
end
-- return just the original char if everything else fails
return save_character_to_cache(xchar, font_id, {char(xchar)})
end
local function nodeText(n)
-- output buffer
local t = {}
for x in node.traverse(n) do
-- glyph node
if x.id == glyph_id then
-- get table with characters for current node.char
local chars = get_unicode(x.char,x.font)
for _, current_char in ipairs(chars) do
-- save characters to the output buffer
table.insert(t,current_char)
end
-- glue node
elseif x.id == glue_id and node.getglue(x) > minglue then
table.insert(t," ")
-- discretionaries
elseif x.id == disc_id then
table.insert(t, nodeText(x.replace))
-- recursivelly process hlist and vlist nodes
elseif x.id == hlist_id or x.id == vlist_id then
table.insert(t,nodeText(x.head))
end
end
return table.concat(t)
end
local n = tex.getbox(0)
local n1 = tex.getbox(1)
print(nodeText(n.head))
local f = io.open("hello.txt","w")
f:write(nodeText(n.head))
f:write(nodeText(n1.head))
f:close()
\end{luacode*}
\box0
\box1
\end{document}
我还添加了来自维基百科。以下是 的内容hello.txt
:
Příliš žluťoučký kůň úpěl ďábelské ódy difference diffierence. एक गांव -- में मोहन नाम का लड़का रहता था। उसके पताजी एक मामूली मजदूर थे।هذه المقالة عن براغ. لتصفح عناوين مشابهة، انظر براغ (توضيح).
这两个重要功能是
local function to_unicode_chars(uchar)
local uchar = uchar or ""
local current = {}
for i= 1, string.len(uchar), 4 do
local cchar = string.sub(uchar, i, i + 3)
table.insert(current,char(tonumber(cchar,16)))
end
return current
end
to_unicode_chars
函数将 to_unicode 条目拆分为四字节块,然后将其转换为 UTF 8 字符。它还可以处理没有条目的字形tounicode
,在这种情况下它只返回空字符串。
local function get_unicode(xchar,font_id)
-- try to load character from cache first
local current_char = get_character_from_cache(xchar, font_id)
if current_char then return current_char end
-- get tounicode for non HarfBuzz fonts
local characters = identifiers[font_id].characters
local uchar = characters[xchar].tounicode
-- stop processing if tounicode exists
if uchar then return save_character_to_cache(xchar, font_id, to_unicode_chars(uchar)) end
-- detect if font is processed by Harfbuzz
local hb = identifiers[font_id].hb
-- try HarfBuzz data
if not uchar and hb then
-- get glyph index of the character
local index = characters[xchar].index
-- load HarfBuzz tables from cache
local glyphs, backmap = initialize_harfbuzz_cache(font_id, hb)
-- get tounicode field from HarfBuzz glyph info
local tounicode = glyphs[index].tounicode
if tounicode then
return save_character_to_cache(xchar, font_id, to_unicode_chars(tounicode))
end
-- if this fails, try backmap, which contains mapping between glyph index and Unicode
local backuni = backmap[index]
if backuni then
return save_character_to_cache(xchar, font_id, {char(backuni)})
end
-- if this fails too, discard this character
return save_character_to_cache(xchar, font_id, {})
end
-- return just the original char if everything else fails
return save_character_to_cache(xchar, font_id, {char(xchar)})
end
此函数首先尝试从当前字体信息加载 Uniocode 数据。如果失败,它会尝试在 Harfbuzz 表中查找。大多数字符在表tounicode
中都有映射glyphs
。如果不可用,它会尝试unicodes
包含字形索引和 Unicode 之间映射的表。如果这都失败了,那么我们会丢弃这个字符。