更高效的字符串提取

Question

你可以尝试这个代码：

\let\ea=\expandafter
\def\scanmacro#1{%
   \bgroup \settocomma { };:."?!@+=\{\}\relax
   \lowercase\ea{\ea\gdef\ea#1\ea{#1}}%
   \edef#1{\detokenize\ea{#1}}%
%  \message{\string#1: \meaning#1} % prints the modified format of the scanned macro
   \ea\egroup
   \ea\wordscan#1,\relax,%
}
\def\settocomma #1{\ifx\relax#1\else \lccode`#1=`, \ea\settocomma\fi}
\def\wordscan#1,{\ifx\relax#1\empty\else 
%  \message{{#1}}  % prints each scanned "word"
   \ifcsname doword:#1\endcsname \csname doword:#1\endcsname \fi
   \ea\wordscan\fi
}
\def\regword#1#2{\ea\gdef\csname doword:\string#1\endcsname{#2}}

\regword {grapes}  {\message{I like grapes.}}
\regword {find}    {\message{We are searching somewhat.}}

\def\mystring{In the {market} you can find some grapes, apples? and grapefruits.}

\scanmacro\mystring % runs \message{We are seachring somewhat.}
                    % and \message{I like grapes.}

我们使用将所有出现的非字母字符替换为逗号，\lowercase并使用将这些逗号的 catcode 重新设置为“普通逗号” \detokenize。因此，宏

In the {market} you can find some grapes, apples? and grapefruits.

修改后如下所示：

in,the„market„you,can,find,some,grapes„apples„and,grapefruits,

\scanword然后我们用逗号分隔的参数扫描这样的宏#1，并单独处理每个扫描到的单词。请注意，有几个“空词”。这没有问题，因为空词没有被注册。删除,,之前的出现\scanword会增加更多无用的计算时间。

您必须将所有不同于字母的字符（您希望在扫描的宏中使用）写入\settocomma由确定后的字符列表中\relax。请注意，第一个{ }表示空格，最后一个\{\}表示{和}，因此它们也被替换为逗号。

此代码中只有内的控制序列\mymacro未解析。我们假设它们不存在于此处。如果不是这样，那么您必须添加第二个

\edef#1{\detokenize\ea{#1}}%

就在之前\lowercase。您可以决定是否\word应解释为word（添加\\到“到逗号”字符列表）或应忽略（不添加\\到“到逗号”）。在第二种情况下，您可以注册\word与不同的东西word。

编辑

由于您关于保留大写字母的评论，我创建了另一种方法，该方法不使用\lowercase，但对每个标记运行一个宏，以便将非字母字符替换为逗号。这种方法的优点是您不需要对“其他字符”列表（可能非常大）运行宏，也不需要对所有大写字母列表（在 Unicode 集中也可能非常大）运行宏。缺点是每个标记的宏处理可能不如高效\lowercase。

\let\ea=\expandafter
\def\scanmacro#1{%
   \bgroup 
   \edef#1{\detokenize\ea{#1}}%
   \edef#1{\ea\replspace#1 \relax}%  replaces spaces to comma
   \edef#1{\ea\replothers#1\relax}%  replaces other characters to comma
%   \message{\string#1: \meaning#1} % prints the modified format
   \ea\egroup \ea\wordscan#1,\relax,%
}
\def\replspace#1 #2{#1\ifx#2\relax \else ,#2\ea\replspace\fi}
\def\replothers#1{\ifx#1\relax\else \ifnum\lccode`#1=0 ,\else #1\fi \ea\replothers\fi}
\def\wordscan#1,{\ifx\relax#1\empty\else 
%  \message{{#1}}  % prints each scanned "word"
   \ifcsname doword:#1\endcsname \csname doword:#1\endcsname \fi
   \ea\wordscan\fi
}
\def\regword#1#2{\ea\gdef\csname doword:\string#1\endcsname{#2}}

\regword {grapes}  {\message{I like grapes}}
\regword {find}    {\message{We are searching somewhat}}

\def\mystring{In the {market} you can find some grapes, apples? and grapefruits.}

\scanmacro\mystring % runs \message{We are seachring somewhat}
                    % and \message{I like grapes}

\bye

主要概念是相同的：我们将空格和非字母字符替换为逗号并运行\wordscan。我们将非字母字符识别为其\lccode等于零的字符。

Answer 1