背景

背景

背景

每个章节标题都包含多个单词,但只有前两个单词可以在其他地方使用(例如,页眉)。

问题

我正在寻找一种通用的分词解决方案,可以将单词分成标记,然后选择这些标记的连续子集。例如:

\define\ChapterQuote{Genius is one percent inspiration and ninety-nine percent perspiration.}

\starttext
  % Output: one percent inspiration
  \splittext[3,5]{\ChapterQuote}

  % Output: Genius is
  \splittext[2]{\ChapterQuote}

  % Output: Genius is one percent 
  \splittext[n-5]{\ChapterQuote}
\stoptext

或者可能是这样的:

% 2 is number of words to keep (n-1 keeps all but last)
% boundary means to use the language's natural word break
% strip trims punctuation characters from each word
% striplast trims punctuation from only the last word
\splittext[2][
  boundary=en,
  strip={,},
  striplast=\punctuation,
]{\namedstructurevariable{chapter}{title}}

代码

下面的代码提供了一个可行的示例,说明如何保留短语的前两个单词,但并不健壮:

% Counts the number of words processed.
\definenumber[TextWordCount][]
\setnumber[TextWordCount][0]

% Process only the first two words within some text.
%
% #1 - A word in the text being processed.
\def\processword#1{%
  % Output only two words.
  \ifnum\rawcountervalue[TextWordCount]<3#1\fi%
  \incrementnumber[TextWordCount]%
  \nospace
}

% Resets the word count when processing some text.
%
% #1 - Text to process.
\define[1]\TextProcessWords{%
  \setnumber[TextWordCount][0]%
  {\bf\processwords{#1}}%
}

\starttext
  \chapter{Mr. Hyde (before the transformation)}
  \input knuth
  \section{section a}
  \input knuth
  \subsection{subsection a}
  \input knuth

  \TextProcessWords{\namedstructurevariable{chapter}{title}}

  \chapter{Dr. Jekyll (after the transformation)}
  \input knuth
  \section{section b}
  \input knuth
  \subsection{subsection b}

  \TextProcessWords{\namedstructurevariable{chapter}{title}}
\stoptext

示例输出文档显示章节标题已成功截断:

示例输出

问题

从文本中提取前 N 个单词的更简洁的方法是什么?(请注意\限制文本以及对文本宽度而不是单词标记的类似工作。)

答案1

我会在 Lua 中执行此操作。将字符串拆分成单词相对容易:

\startluacode
  local split_word = lpeg.tsplitat(lpeg.patterns.space)
  local str = "Genius is one percent inspiration and ninety-nine percent perspiration."
  local words = lpeg.match(split_word, str)
  table.print(words)
\stopluacode

打印

t={
 "Genius",
 "is",
 "one",
 "percent",
 "inspiration",
 "and",
 "ninety-nine",
 "percent",
 "perspiration.",
}

Rest 就是创建一个接口。我发现你提议的接口太混乱了,所以我会简化它。你必须传递两个参数,指定第一个和最后一个单词。如果最后一个为负数,则从末尾开始计算单词:

\startluacode
  local split_word = lpeg.tsplitat(lpeg.patterns.space)
  local lpegmatch  = lpeg.match

  local splittext = function(first, last, str)
      local words = lpeg.match(split_word, str)
      local length = #words
      if first < 1 then first = 1 end
      if last < 0  then last = length + last end
      local t = { }
      for i = first, last do
          t[i - first + 1] = words[i] 
      end
      return t
  end

  local str = "Genius is one percent inspiration and ninety-nine percent perspiration."
  table.print(splittext(3,5, str))
  table.print(splittext(1,3, str))
  table.print(splittext(1,-5, str))
\stopluacode

这使

t={
 "one",
 "percent",
 "inspiration",
}
t={
 "Genius",
 "is",
 "one",
}
t={
 "Genius",
 "is",
 "one",
 "percent",
}

现在我们向 TeX 添加一个接口:

\startluacode
  local split_word = lpeg.tsplitat(lpeg.patterns.space)
  local lpegmatch  = lpeg.match

  local splittext = function(first, last, str)
      local words = lpeg.match(split_word, str)
      local length = #words
      if first < 1 then first = 1 end
      if last < 0  then last = length + last end
      local t = { }
      for i = first, last do
          t[i - first + 1] = words[i] 
      end
      return table.concat(t, " ")
  end

  interfaces.implement {
     name    = "splittext",
     actions = { splittext, context }, 
     arguments = { "integer", "integer", "argument" },

  }
\stopluacode

\unprotect
\permanent\tolerant\protected\def\splittext[#1,#2]#3%
   {\clf_splittext #1 #2 {#3}\relax}
\protect

\starttext
\defineexpandable\ChapterQuote{Genius is one percent inspiration and ninety-nine percent perspiration.}
\splittext[3,5]{\ChapterQuote}

\splittext[1,3]{\ChapterQuote}

\splittext[1,-5]{\ChapterQuote}
\stoptext

这使

在此处输入图片描述

相关内容