背景
每个章节标题都包含多个单词,但只有前两个单词可以在其他地方使用(例如,页眉)。
问题
我正在寻找一种通用的分词解决方案,可以将单词分成标记,然后选择这些标记的连续子集。例如:
\define\ChapterQuote{Genius is one percent inspiration and ninety-nine percent perspiration.}
\starttext
% Output: one percent inspiration
\splittext[3,5]{\ChapterQuote}
% Output: Genius is
\splittext[2]{\ChapterQuote}
% Output: Genius is one percent
\splittext[n-5]{\ChapterQuote}
\stoptext
或者可能是这样的:
% 2 is number of words to keep (n-1 keeps all but last)
% boundary means to use the language's natural word break
% strip trims punctuation characters from each word
% striplast trims punctuation from only the last word
\splittext[2][
boundary=en,
strip={,},
striplast=\punctuation,
]{\namedstructurevariable{chapter}{title}}
代码
下面的代码提供了一个可行的示例,说明如何保留短语的前两个单词,但并不健壮:
% Counts the number of words processed.
\definenumber[TextWordCount][]
\setnumber[TextWordCount][0]
% Process only the first two words within some text.
%
% #1 - A word in the text being processed.
\def\processword#1{%
% Output only two words.
\ifnum\rawcountervalue[TextWordCount]<3#1\fi%
\incrementnumber[TextWordCount]%
\nospace
}
% Resets the word count when processing some text.
%
% #1 - Text to process.
\define[1]\TextProcessWords{%
\setnumber[TextWordCount][0]%
{\bf\processwords{#1}}%
}
\starttext
\chapter{Mr. Hyde (before the transformation)}
\input knuth
\section{section a}
\input knuth
\subsection{subsection a}
\input knuth
\TextProcessWords{\namedstructurevariable{chapter}{title}}
\chapter{Dr. Jekyll (after the transformation)}
\input knuth
\section{section b}
\input knuth
\subsection{subsection b}
\TextProcessWords{\namedstructurevariable{chapter}{title}}
\stoptext
示例输出文档显示章节标题已成功截断:
问题
从文本中提取前 N 个单词的更简洁的方法是什么?(请注意\限制文本以及对文本宽度而不是单词标记的类似工作。)
答案1
我会在 Lua 中执行此操作。将字符串拆分成单词相对容易:
\startluacode
local split_word = lpeg.tsplitat(lpeg.patterns.space)
local str = "Genius is one percent inspiration and ninety-nine percent perspiration."
local words = lpeg.match(split_word, str)
table.print(words)
\stopluacode
打印
t={
"Genius",
"is",
"one",
"percent",
"inspiration",
"and",
"ninety-nine",
"percent",
"perspiration.",
}
Rest 就是创建一个接口。我发现你提议的接口太混乱了,所以我会简化它。你必须传递两个参数,指定第一个和最后一个单词。如果最后一个为负数,则从末尾开始计算单词:
\startluacode
local split_word = lpeg.tsplitat(lpeg.patterns.space)
local lpegmatch = lpeg.match
local splittext = function(first, last, str)
local words = lpeg.match(split_word, str)
local length = #words
if first < 1 then first = 1 end
if last < 0 then last = length + last end
local t = { }
for i = first, last do
t[i - first + 1] = words[i]
end
return t
end
local str = "Genius is one percent inspiration and ninety-nine percent perspiration."
table.print(splittext(3,5, str))
table.print(splittext(1,3, str))
table.print(splittext(1,-5, str))
\stopluacode
这使
t={
"one",
"percent",
"inspiration",
}
t={
"Genius",
"is",
"one",
}
t={
"Genius",
"is",
"one",
"percent",
}
现在我们向 TeX 添加一个接口:
\startluacode
local split_word = lpeg.tsplitat(lpeg.patterns.space)
local lpegmatch = lpeg.match
local splittext = function(first, last, str)
local words = lpeg.match(split_word, str)
local length = #words
if first < 1 then first = 1 end
if last < 0 then last = length + last end
local t = { }
for i = first, last do
t[i - first + 1] = words[i]
end
return table.concat(t, " ")
end
interfaces.implement {
name = "splittext",
actions = { splittext, context },
arguments = { "integer", "integer", "argument" },
}
\stopluacode
\unprotect
\permanent\tolerant\protected\def\splittext[#1,#2]#3%
{\clf_splittext #1 #2 {#3}\relax}
\protect
\starttext
\defineexpandable\ChapterQuote{Genius is one percent inspiration and ninety-nine percent perspiration.}
\splittext[3,5]{\ChapterQuote}
\splittext[1,3]{\ChapterQuote}
\splittext[1,-5]{\ChapterQuote}
\stoptext
这使