完全可扩展的消毒器

Question 1

我很高兴能够教给 Martin Scharrer 一些他不知道的东西:)

完全可扩展的消毒器

以下是\Sanitize命令的实现：

完全删除其参数中的所有控制序列和平衡括号。
不会被嵌套的括号所阻塞。
将空格保留在通过“ ”或“\ ”请求的位置（用于宏之后）。
是完全可扩展的（即可以放入\edef或\csname）。

编辑：这是修订版。我最初的代码有几个小错误，修复起来非常麻烦，这个版本经过了大幅重写。我认为它也更清晰了。

怎么运行的

有三种状态：清理空格、清理组和清理标记。我们一次扫描一个“单词”，然后在每个“单词”中查找可能隐藏空格的组（TeX 的宏扫描器只会吸收带有匹配括号的分隔参数）。最后，一旦我们确信我们正在查看真正连续的标记，我们就会一次扫描一个并丢弃控制序列，只留下明确指定的空格（“ ”或“\ ”）。

从内到外，操作如下所示：

\SanitizeTokens是一个大型嵌套条件，用于针对各种特殊情况测试其参数。在扫描空格期间，所有空格字符都转换为\SanitizedSpace标记，现在它们转换为\RealSpaces。\SanitizedSpace和\SanitizeStop都是扩展为自身的宏，由于它们是私有的，这意味着通过对它们进行测试\ifx是一种可靠的方法来检测确切的控制序列（在第一个版本中，这些是\countdef标记，它们具有相同的属性，但不那么私有）。
\SanitizeGroups\def\SanitizeGroups#1#{使用本题中讨论的棘手构造：以 # 作为最后一个参数的宏。这是我能想象到的最合法的用法：它的目的是检测组，这是你无法以任何其他方式使用普通宏扩展来实现的。它#1保证不组，由于这是在空格消除之后进行的，所以其中也没有空格，所以我们可以\SanitizeTokens直接运行。然后我们“进入”组并返回到消除空格。
\SanitizeSpaces使用模式匹配来抓取第一个文本块直到空格，当然排除那些成组的空格。这里有一个技术技巧：每次使用此宏时{}，文本前面都有空格。这样做的目的是为了让参数扫描器不会删除空格之间构成整个“单词”的组的括号。如果发生这种情况，那么我们会错误地将其视为已经清除了空格，而事实上并没有。（任何未清理的空格都会被吃掉，\SanitizeTokens因为参数扫描会忽略空格。）
当然还有一些实用的宏。我最喜欢的是\IfNoGapToStop，它的调用方式如下：\IfNoGapToStop.X. \SanitizeStop，其中X是可能包含间隙的数量。如果没有间隙，则第一个间隙是句号后的可见空间；如果有间隙，则两个句号位于不同的组件中，并且的两个参数\IfNoGapToStop都为非空。

除了与上一版本相比的结构变化外，该版本还正确地保留了组边界处的空格。（该版本没有明确扫描组，但作为吸收标记的副作用，将其消除。这有效，但也使得无法确定您正在查看的组可能有空格，而不是单个标记。）

哦，当然了：算法不再愚蠢了。上一个版本在查找单词时会反复重新扫描整个文本的初始部分（这样做的目的是在清理这些标记之前不“丢失”它们）。现在我一次抓取一个单词，因此在寻找下一个单词时放弃每个单词是没有问题的。这将二次算法变成了线性算法。

这不再是我编写 TeX 的首选方式（为此，您应该阅读这个答案：如何编写可读的命令) 但pgfkeys实际上它并不是用于这种文本解析的工具。

\documentclass{article}

\makeatletter
\newcommand\Sanitize[1]{%
 \SanitizeSpaces{}#1 \SanitizeStop
}

% This loops through and replaces all spaces (outside brace groups) with \SanitizedSpace's.
% Then it goes for the control sequences.
% All calls to this should put a {} right before the content, to inhibit the gobbling of braces
% if there is a group right at the beginning.
\def\SanitizeSpaces#1 #2\SanitizeStop{%
 \IfEmpty{#2}% Last word
  {\IfEmpty{#1}% No content at all
   {}% Nothing to do
   {\SanitizeGroups#1{\SanitizeStop}}%
  }%
  % No need for a trailing space anymore: there's already one from the initial call
  {\SanitizeGroups#1\SanitizedSpace{\SanitizeStop}\SanitizeSpaces{}#2\SanitizeStop}%
}

% Sanitize tokens up to the next group, then go back to doing spaces.
\def\SanitizeGroups#1#{%
 \SanitizeTokens#1\SanitizeStop
 \EnterGroup
}

% Sanitize the next group from the top.
\newcommand\EnterGroup[1]{%
 \ifx\SanitizeStop#1%
  \expandafter\@gobble
 \else
  \expandafter\@firstofone
 \fi
 {\SanitizeSpaces{}#1 \SanitizeStop\SanitizeGroups}%
}

\newcommand\SanitizeTokens[1]{%
 \ifx\SanitizeStop#1%
 \else
  \ifx\SanitizedSpace#1%
   \RealSpace
  \else
   \ifx\ #1%
    \RealSpace
   \else
    \if\relax\noexpand#1%
    \else
     #1%
    \fi
   \fi
  \fi
  \expandafter\SanitizeTokens
 \fi
}

% We use TeX's proclivity to eat braces even for delimited arguments to eat the braces if #1 
% happens to be just {}, which we put in.
% Even if we didn't put it in, {} is going to get thrown out when \SanitizeSpaces gets to it.
\newcommand\IfEmpty[1]{%
 \IfOneTokenToStop.#1\SanitizeStop
  {% #1 has at most space tokens
   % and thus is nonempty if and only if there is a gap:
   \IfNoGapToStop.#1. \SanitizeStop
  }
  {% #1 has non-space tokens
   \@secondoftwo
  }%
}

% Checks for a gap in #1, meaning #2 is nonempty
% This should only be used with \IfEmpty
\def\IfNoGapToStop#1 #2\SanitizeStop{%
 % It's enough to check for one token, since #2 is never just spaces
 \IfOneTokenToStop.#2\SanitizeStop
}

\def\IfOneTokenToStop#1#2{% From \IfEmpty, #1 is always a .
 \ifx\SanitizeStop#2%
  % If #2 is multi-token, the rest of it will fall in the one-token case and be passed over.
  % If not, well, that's what we asked for.
  \expandafter\@firstoftwo
 \else
  \expandafter\GobbleToStopAndSecond
 \fi
}

\def\GobbleToStopAndSecond#1\SanitizeStop{%
 \@secondoftwo
}
\makeatother

\def\SanitizeStop{\SanitizeStop}
\def\SanitizedSpace{\SanitizedSpace}
\def\RealSpace{ }

\begin{document}
\setlength\parindent{0pt}\tt

% Torture test
\edef\a{%
 \Sanitize{ Word1 \macro{Word2 Word3}{\macro\ Word4}{ Word5} {Word6 }{}Word7{ }{{Word8}} }
}\meaning\a

\a
\medskip

% Examples
\edef\a{%
 \Sanitize{\emph{This} sentence has \TeX\ macros and {grouping}. }
}\meaning\a

\a
\medskip

\edef\a{%
 \Sanitize{{A}{ gratuitously {nested} sentence {}{{with many} layers}}.}
}\meaning\a

\a
\medskip

\end{document}

Answer

我很高兴能够教给 Martin Scharrer 一些他不知道的东西:)

完全可扩展的消毒器

以下是\Sanitize命令的实现：

完全删除其参数中的所有控制序列和平衡括号。
不会被嵌套的括号所阻塞。
将空格保留在通过“ ”或“\ ”请求的位置（用于宏之后）。
是完全可扩展的（即可以放入\edef或\csname）。

编辑：这是修订版。我最初的代码有几个小错误，修复起来非常麻烦，这个版本经过了大幅重写。我认为它也更清晰了。

怎么运行的

有三种状态：清理空格、清理组和清理标记。我们一次扫描一个“单词”，然后在每个“单词”中查找可能隐藏空格的组（TeX 的宏扫描器只会吸收带有匹配括号的分隔参数）。最后，一旦我们确信我们正在查看真正连续的标记，我们就会一次扫描一个并丢弃控制序列，只留下明确指定的空格（“ ”或“\ ”）。

从内到外，操作如下所示：

\SanitizeTokens是一个大型嵌套条件，用于针对各种特殊情况测试其参数。在扫描空格期间，所有空格字符都转换为\SanitizedSpace标记，现在它们转换为\RealSpaces。\SanitizedSpace和\SanitizeStop都是扩展为自身的宏，由于它们是私有的，这意味着通过对它们进行测试\ifx是一种可靠的方法来检测确切的控制序列（在第一个版本中，这些是\countdef标记，它们具有相同的属性，但不那么私有）。
\SanitizeGroups\def\SanitizeGroups#1#{使用本题中讨论的棘手构造：以 # 作为最后一个参数的宏。这是我能想象到的最合法的用法：它的目的是检测组，这是你无法以任何其他方式使用普通宏扩展来实现的。它#1保证不组，由于这是在空格消除之后进行的，所以其中也没有空格，所以我们可以\SanitizeTokens直接运行。然后我们“进入”组并返回到消除空格。
\SanitizeSpaces使用模式匹配来抓取第一个文本块直到空格，当然排除那些成组的空格。这里有一个技术技巧：每次使用此宏时{}，文本前面都有空格。这样做的目的是为了让参数扫描器不会删除空格之间构成整个“单词”的组的括号。如果发生这种情况，那么我们会错误地将其视为已经清除了空格，而事实上并没有。（任何未清理的空格都会被吃掉，\SanitizeTokens因为参数扫描会忽略空格。）
当然还有一些实用的宏。我最喜欢的是\IfNoGapToStop，它的调用方式如下：\IfNoGapToStop.X. \SanitizeStop，其中X是可能包含间隙的数量。如果没有间隙，则第一个间隙是句号后的可见空间；如果有间隙，则两个句号位于不同的组件中，并且的两个参数\IfNoGapToStop都为非空。

除了与上一版本相比的结构变化外，该版本还正确地保留了组边界处的空格。（该版本没有明确扫描组，但作为吸收标记的副作用，将其消除。这有效，但也使得无法确定您正在查看的组可能有空格，而不是单个标记。）

哦，当然了：算法不再愚蠢了。上一个版本在查找单词时会反复重新扫描整个文本的初始部分（这样做的目的是在清理这些标记之前不“丢失”它们）。现在我一次抓取一个单词，因此在寻找下一个单词时放弃每个单词是没有问题的。这将二次算法变成了线性算法。

这不再是我编写 TeX 的首选方式（为此，您应该阅读这个答案：如何编写可读的命令) 但pgfkeys实际上它并不是用于这种文本解析的工具。

\documentclass{article}

\makeatletter
\newcommand\Sanitize[1]{%
 \SanitizeSpaces{}#1 \SanitizeStop
}

% This loops through and replaces all spaces (outside brace groups) with \SanitizedSpace's.
% Then it goes for the control sequences.
% All calls to this should put a {} right before the content, to inhibit the gobbling of braces
% if there is a group right at the beginning.
\def\SanitizeSpaces#1 #2\SanitizeStop{%
 \IfEmpty{#2}% Last word
  {\IfEmpty{#1}% No content at all
   {}% Nothing to do
   {\SanitizeGroups#1{\SanitizeStop}}%
  }%
  % No need for a trailing space anymore: there's already one from the initial call
  {\SanitizeGroups#1\SanitizedSpace{\SanitizeStop}\SanitizeSpaces{}#2\SanitizeStop}%
}

% Sanitize tokens up to the next group, then go back to doing spaces.
\def\SanitizeGroups#1#{%
 \SanitizeTokens#1\SanitizeStop
 \EnterGroup
}

% Sanitize the next group from the top.
\newcommand\EnterGroup[1]{%
 \ifx\SanitizeStop#1%
  \expandafter\@gobble
 \else
  \expandafter\@firstofone
 \fi
 {\SanitizeSpaces{}#1 \SanitizeStop\SanitizeGroups}%
}

\newcommand\SanitizeTokens[1]{%
 \ifx\SanitizeStop#1%
 \else
  \ifx\SanitizedSpace#1%
   \RealSpace
  \else
   \ifx\ #1%
    \RealSpace
   \else
    \if\relax\noexpand#1%
    \else
     #1%
    \fi
   \fi
  \fi
  \expandafter\SanitizeTokens
 \fi
}

% We use TeX's proclivity to eat braces even for delimited arguments to eat the braces if #1 
% happens to be just {}, which we put in.
% Even if we didn't put it in, {} is going to get thrown out when \SanitizeSpaces gets to it.
\newcommand\IfEmpty[1]{%
 \IfOneTokenToStop.#1\SanitizeStop
  {% #1 has at most space tokens
   % and thus is nonempty if and only if there is a gap:
   \IfNoGapToStop.#1. \SanitizeStop
  }
  {% #1 has non-space tokens
   \@secondoftwo
  }%
}

% Checks for a gap in #1, meaning #2 is nonempty
% This should only be used with \IfEmpty
\def\IfNoGapToStop#1 #2\SanitizeStop{%
 % It's enough to check for one token, since #2 is never just spaces
 \IfOneTokenToStop.#2\SanitizeStop
}

\def\IfOneTokenToStop#1#2{% From \IfEmpty, #1 is always a .
 \ifx\SanitizeStop#2%
  % If #2 is multi-token, the rest of it will fall in the one-token case and be passed over.
  % If not, well, that's what we asked for.
  \expandafter\@firstoftwo
 \else
  \expandafter\GobbleToStopAndSecond
 \fi
}

\def\GobbleToStopAndSecond#1\SanitizeStop{%
 \@secondoftwo
}
\makeatother

\def\SanitizeStop{\SanitizeStop}
\def\SanitizedSpace{\SanitizedSpace}
\def\RealSpace{ }

\begin{document}
\setlength\parindent{0pt}\tt

% Torture test
\edef\a{%
 \Sanitize{ Word1 \macro{Word2 Word3}{\macro\ Word4}{ Word5} {Word6 }{}Word7{ }{{Word8}} }
}\meaning\a

\a
\medskip

% Examples
\edef\a{%
 \Sanitize{\emph{This} sentence has \TeX\ macros and {grouping}. }
}\meaning\a

\a
\medskip

\edef\a{%
 \Sanitize{{A}{ gratuitously {nested} sentence {}{{with many} layers}}.}
}\meaning\a

\a
\medskip

\end{document}

Question 2

更简单的方法是使用\detokenize：

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\makeatletter
\newcommand\removecs[1]{\expandafter\if@cs\detokenize{#1}\@nil{\expandafter\removecs@i\detokenize{#1}\@nil}{\detokenize{#1}}}
\begingroup
    \catcode`\|0 |catcode`|\12
    |catcode`|<1 |catcode`|>2
    |catcode`|{12 |catcode`|}12
    |gdef|if@firstisbrace#1<|if@firstisbrace@i#1{|@nil>
    |gdef|if@firstisbrace@i#1{#2|@nil<|csname @|ifx|@empty#1|@empty first|else second|fi oftwo|endcsname>
    |gdef|if@cs#1|@nil<|if@cs@i#1\|@nil>
    |gdef|if@cs@i#1\#2|@nil<|csname @|ifx|@empty#2|@empty second|else first|fi oftwo|endcsname>
    |gdef|remove@braces{#1}#2|@nil<|if@cs#1#2|@nil<|removecs@i#1#2|@nil><#1#2>>
    |gdef|removecs@i#1\#2 #3|@nil<%
        #1%
        |if@firstisbrace<#3>
            <|remove@braces#3|@nil>
            <|if@cs#3|@nil<|removecs@i#3|@nil><#3>>%
        >
|endgroup

\makeatother
\begin{document}
\removecs{abcd \textit{f\textbf{oo}bar} ijkl\foo{1}2\bar3{4}!}

\edef\foobar{\removecs{abcd \textit{f\textbf{oo}bar} ijkl\foo{1}2\bar3{4}!}}
\meaning\foobar
\end{document}

Answer

更简单的方法是使用\detokenize：

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\makeatletter
\newcommand\removecs[1]{\expandafter\if@cs\detokenize{#1}\@nil{\expandafter\removecs@i\detokenize{#1}\@nil}{\detokenize{#1}}}
\begingroup
    \catcode`\|0 |catcode`|\12
    |catcode`|<1 |catcode`|>2
    |catcode`|{12 |catcode`|}12
    |gdef|if@firstisbrace#1<|if@firstisbrace@i#1{|@nil>
    |gdef|if@firstisbrace@i#1{#2|@nil<|csname @|ifx|@empty#1|@empty first|else second|fi oftwo|endcsname>
    |gdef|if@cs#1|@nil<|if@cs@i#1\|@nil>
    |gdef|if@cs@i#1\#2|@nil<|csname @|ifx|@empty#2|@empty second|else first|fi oftwo|endcsname>
    |gdef|remove@braces{#1}#2|@nil<|if@cs#1#2|@nil<|removecs@i#1#2|@nil><#1#2>>
    |gdef|removecs@i#1\#2 #3|@nil<%
        #1%
        |if@firstisbrace<#3>
            <|remove@braces#3|@nil>
            <|if@cs#3|@nil<|removecs@i#3|@nil><#3>>%
        >
|endgroup

\makeatother
\begin{document}
\removecs{abcd \textit{f\textbf{oo}bar} ijkl\foo{1}2\bar3{4}!}

\edef\foobar{\removecs{abcd \textit{f\textbf{oo}bar} ijkl\foo{1}2\bar3{4}!}}
\meaning\foobar
\end{document}

Question 3

如果你的文本只包含“安全”文本（没有重音字符）和有限数量的控制序列，已知先验，然后

\makeatletter
\def\StripControlSequences#1{%
  \begingroup
  \let\textit\@firstofone
  \let\textbf\@firstofone
  \edef\x{\endgroup#1}\x}
\makeatother

如果需要，请添加其他所需序列。如果可能的控制序列都是“一个参数”类型，那么应该可以。否则，您需要逐个检查标记。

然而，如果你的情况是不是打印参数，而是通过基于它定义一些东西\csname，那么马丁的建议\detokenize就很好。

应该提供更多背景信息，以便提出最佳策略。可能某些键更合适，因为用户可能不应该输入

\whatever{John Q. Author, \textit{Book Title}}

为了得到

 <full citation string>

在他们的打印输出中：如果参数与\whatever它所显示的不完全一致，则使用它将失败。

Answer

如果你的文本只包含“安全”文本（没有重音字符）和有限数量的控制序列，已知先验，然后

\makeatletter
\def\StripControlSequences#1{%
  \begingroup
  \let\textit\@firstofone
  \let\textbf\@firstofone
  \edef\x{\endgroup#1}\x}
\makeatother

如果需要，请添加其他所需序列。如果可能的控制序列都是“一个参数”类型，那么应该可以。否则，您需要逐个检查标记。

然而，如果你的情况是不是打印参数，而是通过基于它定义一些东西\csname，那么马丁的建议\detokenize就很好。

应该提供更多背景信息，以便提出最佳策略。可能某些键更合适，因为用户可能不应该输入

\whatever{John Q. Author, \textit{Book Title}}

为了得到

 <full citation string>

在他们的打印输出中：如果参数与\whatever它所显示的不完全一致，则使用它将失败。

Question 4

您可以定义一个，但不能以可扩展的方式定义（据我所知）。您需要逐个读取 token（这需要不可扩展的赋值），然后测试它是否是控制序列并将其剥离。如果您可以将 strip 宏放在前面\csname并使其将结果定义为可扩展的宏，那么即使在您的情况下，这也可以工作。然而，整个事情并不是那么简单。

另一种方法是使用 e-TeX\detokenize{..}将控制序列转换为可在内部使用的普通文本\csname：

\csname\detokenize{#1}\endcsname

此处#1可以包含宏，但宏将被视为文件名的一部分。如果这不是问题，我会选择这个。

还有一种非 e-TeX 的方式：

\def\@tempa{#1}%
\@onelevel@sanitize\@tempa
\csname\@tempa\endcsname

Answer

您可以定义一个，但不能以可扩展的方式定义（据我所知）。您需要逐个读取 token（这需要不可扩展的赋值），然后测试它是否是控制序列并将其剥离。如果您可以将 strip 宏放在前面\csname并使其将结果定义为可扩展的宏，那么即使在您的情况下，这也可以工作。然而，整个事情并不是那么简单。

另一种方法是使用 e-TeX\detokenize{..}将控制序列转换为可在内部使用的普通文本\csname：

\csname\detokenize{#1}\endcsname

此处#1可以包含宏，但宏将被视为文件名的一部分。如果这不是问题，我会选择这个。

还有一种非 e-TeX 的方式：

\def\@tempa{#1}%
\@onelevel@sanitize\@tempa
\csname\@tempa\endcsname

完全可扩展的消毒器

答案1

完全可扩展的消毒器

怎么运行的

答案2

答案3

答案4

相关内容