更高效的字符串提取

更高效的字符串提取

如果 latex3 字符串中存在某些单词,我想执行一些代码。我想出了自己的实现,基本上使用 进行循环\str_map_inline并使用 跟踪当前单词的最后一部分\str_put_right,但结果比我预期的要慢 500 倍(与\str_if_in:NnTF大致执行相同数量的操作相比),这使我的整个库在一个微小的操作上慢了 20%。知道我做错了什么吗?

在此处输入图片描述

梅威瑟:

\documentclass{article}
\usepackage{l3benchmark}
\begin{document}

Test

\ExplSyntaxOn

%%%%%%%%%%%%%% Library to make more efficient

% \__robExt_auto_forward_words:N \commandToRunOnEachWord \stringToSearchOn
\cs_set:Nn \__robExt_auto_forward_words:NN {
  % \l_tmpa_str will contain the current word read so far
  \str_set:Nn \l_tmpa_str {}%
  \str_map_inline:Nn #2 {
    % \token_case_charcode:NnTF ##1 {} {} {}
    \__robExt_if_letter:nTF {##1} {
      \str_put_right:Nn \l_tmpa_str {##1}
    }{
      \str_if_empty:NTF \l_tmpa_str { } {
        % if the string is empty, we run the command on the string
        #1 \l_tmpa_str%
        \str_set:Nn \l_tmpa_str {}% we reset its value
      }
    }
  }
}

%% \__robExt_if_letter:nTF {char} {true} {false} tests if an element is a letter
%% https://tex.stackexchange.com/a/700864/116348
\prg_new_conditional:Npnn \__robExt_if_letter:n #1 { TF }
{
  \bool_lazy_or:nnTF
  {
    \bool_lazy_and_p:nn
    { \int_compare_p:nNn { `#1 } > { `a - 1 } }
    { \int_compare_p:nNn { `#1 } < { `z + 1 } }
  }
  {
    \bool_lazy_and_p:nn
    { \int_compare_p:nNn { `#1 } > { `A - 1 } }
    { \int_compare_p:nNn { `#1 } < { `Z + 1 } }
  }
  \prg_return_true:
  \prg_return_false:
}

% \robExt_register_match_word {namespace that defaults to empty} {word} {code to run if word is present} 
\cs_set:Nn \robExt_register_match_word:nnn {
  \cs_set:cn {l__robExt_execute_if_word_present_#1_#2:} {#3}
}

% \robExt_try_to_execute_if_match_word:nn {namespace} {word}
\cs_set:Nn \robExt_try_to_execute_if_match_word:nn {
  \cs_if_exist:cTF {l__robExt_execute_if_word_present_#1_#2:} {%
    \cs_if_exist:cTF {l__robExt_execute_if_word_present_#1_#2__already_forwarded:}{\message{Already forwarded}}{
      \use:c {l__robExt_execute_if_word_present_#1_#2:}%
      % define it so that we do not import twice next time
      \cs_set:cx {l__robExt_execute_if_word_present_#1_#2__already_forwarded:} {}
    }
  } { }
}
\cs_generate_variant:Nn \robExt_try_to_execute_if_match_word:nn { nV }

%%%%%%%%%%%%%% Usage

\robExt_register_match_word:nnn {} {grapes} {I~like~grapes.\\}
\robExt_register_match_word:nnn {} {grapefruits} {In~hate~grapefruits.\\}

%% This string is already created for other reasons, so you can safely assume it exists
\str_new:N \l_my_str
\str_set:Nn \l_my_str {In~the~market~you~can~find~some~grapes~and~grapefruits.}

My~string~is~''\l_my_str''.\newline

\NewDocumentCommand{\testAutoForward}{}{
  \cs_set:Nn \__robExt_tmp_fct:N {
    \message{I will try to run ##1}
    \robExt_try_to_execute_if_match_word:nV {} ##1
  }
  \__robExt_auto_forward_words:NN \__robExt_tmp_fct:N \l_my_str
}

\cs_new:Nn \robExt_benchmark_me:n {
  \benchmark:n {#1}
  Number~of~operations~taken~by:\par\texttt{\detokenize{#1}}\par~is~\fp_to_scientific:N\g_benchmark_ops_fp.
  Time~taken~by:\par\texttt{\detokenize{#1}}\par is~\fp_to_scientific:N\g_benchmark_time_fp.
}
\fp_new:N \l_robExt_fp
\fp_set_eq:NN \l_robExt_fp \g_benchmark_time_fp
\robExt_benchmark_me:n {\testAutoForward}

\par Second test (reference time I'd like to reach):\par
\robExt_benchmark_me:n {
  \str_if_in:NnTF \l_my_str {grapes}{%
    % Not sure why I cannot print this with getting "TeX capacity exceeded", I guess because it repeats it a lot?
    % I~like~grapes.
  }{}
  \str_if_in:NnTF \l_my_str {grapefruits}{}{}
}

% Not sure why this prints "ERROR: Use of \??? doesn't match its definition."
% The~reference~implementation~is~\fp_eval:n{(\g_benchmark_time_fp) / (\l_robExt_fp)}~times~faster.
\ExplSyntaxOff

\end{document}

编辑

为了更准确地回答评论,我有一个字符串(latex3,即我认为所有内容都应该是聊天代码其他或空格)\mystring,并且我想提取所有单词([a-zA-Z]+)来运行some code可能已通过注册的相应单词\registerWord{myWord}{some code}。因此,如果\mystring包含:

In the market you can find some grapes, apples, and grapefruits.

如果我跑\registerWord{grapes}{\message{I like grapes}},那么跑步\extractAndExecuteWords \mystring就应该跑\message{I like grapes}

我第一次尝试使用普通乳胶(但出现多个问题:字符串中的空格被删除,而且我找不到如何在宏中插入括号,因此我插入了 bgroups,但它并不等效,并且如何向宏中添加单个花括号?给了我奇怪的错误):

\documentclass{article}
\begin{document}

\ExplSyntaxOn
\str_new:N \l_my_str
\str_set:Nn \l_my_str {In~the~market~you~can~find~some~grapes, apples,~and~grapefruits.}
\let\myString\l_my_str

\ExplSyntaxOff
\makeatletter
% \autoForwardWords \stringToSearchOn
\def\autoForwardWords#1#2{%
  \def\robExt@tmp@word{}%
  \let\robExt@cmd@to@run#1%
  \message{AAAAAAAAA #2}%
  \edef\robExt@list@of@commands{%
    \noexpand\robExt@cmd@to@run\noexpand\bgroup%
    \expandafter\autoForwardWords@aux#2\robExt@end@of@string% \autoForwardWords@aux is the end of the string
  }%
  %% This shows the command to run, with two issues:
  %% 1) it removed spaces in the string
  %% 2) I can't find how to add braces instead of bgroups.
  %%    I tried https://tex.stackexchange.com/questions/506613/how-can-i-add-a-single-curly-brace-to-a-macro
  %%    but I was getting errors.
  %%\show\robExt@list@of@commands
  \robExt@list@of@commands
}

\def\autoForwardWords@aux#1{%
  \ifx#1\robExt@end@of@string% We arrived at the end of the string
    \noexpand\bgroup%
  \else%
    \ifnum`#1>\numexpr `a-1\relax%
      \ifnum`#1<\numexpr `z+1\relax%
        #1%
      \else%
        \noexpand\egroup\noexpand\robExt@cmd@to@run\noexpand\bgroup%
      \fi%
    \else%
      \ifnum`#1>\numexpr `A-1\relax%
        \ifnum`#1<\numexpr `Z+1\relax%
          #1%
        \else%
          \noexpand\egroup\noexpand\robExt@cmd@to@run\noexpand\bgroup%
        \fi%
      \else%
        \noexpand\egroup\noexpand\robExt@cmd@to@run\noexpand\bgroup%
      \fi%      
    \fi%
    \expandafter\autoForwardWords@aux% let it grap the next character
  \fi%
}
\def\robExt@end@of@string{}

\def\printWord#1{I saw --((#1))--.}
\autoForwardWords\printWord\myString
\makeatother

\end{document}

答案1

你可以尝试这个代码:

\let\ea=\expandafter
\def\scanmacro#1{%
   \bgroup \settocomma { };:."?!@+=\{\}\relax
   \lowercase\ea{\ea\gdef\ea#1\ea{#1}}%
   \edef#1{\detokenize\ea{#1}}%
%  \message{\string#1: \meaning#1} % prints the modified format of the scanned macro
   \ea\egroup
   \ea\wordscan#1,\relax,%
}
\def\settocomma #1{\ifx\relax#1\else \lccode`#1=`, \ea\settocomma\fi}
\def\wordscan#1,{\ifx\relax#1\empty\else 
%  \message{{#1}}  % prints each scanned "word"
   \ifcsname doword:#1\endcsname \csname doword:#1\endcsname \fi
   \ea\wordscan\fi
}
\def\regword#1#2{\ea\gdef\csname doword:\string#1\endcsname{#2}}

\regword {grapes}  {\message{I like grapes.}}
\regword {find}    {\message{We are searching somewhat.}}

\def\mystring{In the {market} you can find some grapes, apples? and grapefruits.}

\scanmacro\mystring % runs \message{We are seachring somewhat.}
                    % and \message{I like grapes.}

我们使用 将所有出现的非字母字符替换为逗号,\lowercase并使用 将这些逗号的 catcode 重新设置为“普通逗号” \detokenize。因此,宏

In the {market} you can find some grapes, apples? and grapefruits.

修改后如下所示:

in,the„market„you,can,find,some,grapes„apples„and,grapefruits,

\scanword然后我们用逗号分隔的参数扫描这样的宏#1,并单独处理每个扫描到的单词。请注意,有几个“空词”。这没有问题,因为空词没有被注册。删除,,之前的出现\scanword会增加更多无用的计算时间。

您必须将所有不同于字母的字符(您希望在扫描的宏中使用)写入\settocomma由 确定后的字符列表中\relax。请注意,第一个{ }表示空格,最后一个\{\}表示{},因此它们也被替换为逗号。

此代码中只有 内的控制序列\mymacro未解析。我们假设它们不存在于此处。如果不是这样,那么您必须添加第二个

\edef#1{\detokenize\ea{#1}}%

就在 之前\lowercase。您可以决定 是否\word应解释为word(添加\\到“到逗号”字符列表)或应忽略(不添加\\到“到逗号”)。在第二种情况下,您可以注册\word与 不同的东西word

编辑

由于您关于保留大写字母的评论,我创建了另一种方法,该方法不使用\lowercase,但对每个标记运行一个宏,以便将非字母字符替换为逗号。这种方法的优点是您不需要对“其他字符”列表(可能非常大)运行宏,也不需要对所有大写字母列表(在 Unicode 集中也可能非常大)运行宏。缺点是每个标记的宏处理可能不如 高效\lowercase

\let\ea=\expandafter
\def\scanmacro#1{%
   \bgroup 
   \edef#1{\detokenize\ea{#1}}%
   \edef#1{\ea\replspace#1 \relax}%  replaces spaces to comma
   \edef#1{\ea\replothers#1\relax}%  replaces other characters to comma
%   \message{\string#1: \meaning#1} % prints the modified format
   \ea\egroup \ea\wordscan#1,\relax,%
}
\def\replspace#1 #2{#1\ifx#2\relax \else ,#2\ea\replspace\fi}
\def\replothers#1{\ifx#1\relax\else \ifnum\lccode`#1=0 ,\else #1\fi \ea\replothers\fi}
\def\wordscan#1,{\ifx\relax#1\empty\else 
%  \message{{#1}}  % prints each scanned "word"
   \ifcsname doword:#1\endcsname \csname doword:#1\endcsname \fi
   \ea\wordscan\fi
}
\def\regword#1#2{\ea\gdef\csname doword:\string#1\endcsname{#2}}

\regword {grapes}  {\message{I like grapes}}
\regword {find}    {\message{We are searching somewhat}}

\def\mystring{In the {market} you can find some grapes, apples? and grapefruits.}

\scanmacro\mystring % runs \message{We are seachring somewhat}
                    % and \message{I like grapes}

\bye

主要概念是相同的:我们将空格和非字母字符替换为逗号并运行\wordscan。我们将非字母字符识别为其\lccode等于零的字符。

相关内容