逐个字母解析文本

逐个字母解析文本

PGF包有一个名为的模块parser,可以逐字地解析一段文本,从初始状态到最终状态。

例如,以下 MWE 将解析给定的文本并计算“Z”的数量,忽略b's并突出显示a's

\documentclass{article}
\usepackage{pgf}
\usepackage{soul}
\usepgfmodule{parser}
\begin{document}
\newcount\mycount

\pgfparserdef{myparser}{initial}{the letter Z}{\advance\mycount by 1\relax Z}
\pgfparserdef{myparser}{initial}{the letter a}{\hl{a}}
\pgfparserdef{myparser}{initial}{the letter b}{} % do nothing
\pgfparserdef{myparser}{initial}{the letter c}{\pgfparserswitch{final}}% done!

\pgfparserparse{myparser}ZZZZaabaabaZbabbbbbabaabcccc%

There are \the\mycount\ Z's.

\end{document} 

如果示例文本中包含未知字母,代码将触发错误(我希望忽略它)。有没有一种简便的方法来定义这样的操作,或者我需要定义所有字母?

答案1

前段时间,我用 LuaLaTeX 创建了一个类似的解析器函数。我用它读取文本文件,计数和更改一些字符,并将一些 LaTeX 命令放入输出中。

\documentclass{book}
\usepackage{filecontents}

%create a test text file
\begin{filecontents*}{lorem.txt}
Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore 
magna aliquyam erat, sed diam voluptua. At vero eos et accusam 
et justo duo dolores et ea rebum.
\end{filecontents*}

%create a lua script file
\begin{filecontents*}{luaFunctions.lua}
function createReplaceTable()    
    replaceTable = {}

    -- create a table with all ASCII chars
    -- the name and(!) the value of each table item is the ASCII char
    -- this is important if the char shouldn't be replaced
    -- the table have 128 items each filled with the corresponding char
    for i = 1, 128, 1 do    
       replaceTable[string.char(i-1)] = string.char(i-1)
    end
end

function parseString(input)
    outputString = ""

    -- for each char in the given string we replace
    -- the char with the content of the table item
    -- because the table items have the same name like the chars
    -- we have access to the table item via the given char
    for i = 1, string.len(input) do
        char = input:sub(i, i)
        outputString = outputString..replaceTable[char]
    end

    tex.print(outputString)
end

function parseFile(fileName)
    -- open file
    local input = io.open('lorem.txt', 'r')

    -- parse each line
    for line in input:lines() do
        parseString(line)
    end
end

function fillReplaceTable()
    -- here we fill/override the replacements for each ASCII char
    replaceTable["L"] = "\\textbf{\\large L}\\marginpar{\\tiny 'L'(\\stepcounter{counterForL}\\#\\thecounterForL)}"
    replaceTable["o"] = "\\underline{o}"
    replaceTable["e"] = ""
end
\end{filecontents*}    

% read the external lua file to declare the functions,
% but without execute the Lua commands and functions
\directlua{dofile("luaFunctions.lua")}

%create and fill the tables
\directlua{createReplaceTable()}
\directlua{fillReplaceTable()}

% latex commands to execute the lua functions
\def\parseString#1{\directlua{parseString("#1")}}
\def\parseFile#1{\directlua{parseFile("#1")}}

%counter for the letter 'L'
\newcounter{counterForL}

\begin{document}
\parseString{%
 Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.%
}

\parseFile{lorem.txt}
\end{document}

在此处输入图片描述

答案2

你想要这样的东西吗?

\documentclass{article}
\usepackage{xparse}
\usepackage{soul}
\ExplSyntaxOn
\NewDocumentCommand{\xparserdef}{mmmm}
  {
   \cs_new:cpn { xparser_name_#1_state_#2_#3: } { #4 }
  }
\NewDocumentCommand{\xparserparse}{mm}
  {
   \tl_set:Nn \l_xparser_state_tl { initial }
   \tl_set:Nx \l_tmpa_tl { \tl_to_str:n {#2} } 
   \tl_replace_all:NnV \l_tmpa_tl { ~ } \c_catcode_other_space_tl
   \tl_map_inline:Nn\l_tmpa_tl
     {
      \str_if_eq:VnF \l_xparser_state_tl { final }
        { \use:c { xparser_name_#1_state_ \l_xparser_state_tl _##1: } }
     }
  }
\tl_new:N \l_xparser_state_tl
\cs_generate_variant:Nn \tl_replace_all:Nnn {NnV}
\NewDocumentCommand{\xparserswitch}{m}
  {
   \tl_set:Nn \l_xparser_state_tl { #1 }
  }
\ExplSyntaxOff

\begin{document}
\newcount\mycount

\xparserdef{myparser}{initial}{Z}{\advance\mycount by 1\relax Z}
\xparserdef{myparser}{initial}{a}{\hl{a}}
\xparserdef{myparser}{initial}{b}{} % do nothing
\xparserdef{myparser}{initial}{ }{\textcolor{red}{S}}
\xparserdef{myparser}{initial}{c}{\xparserswitch{final}}% done!
\xparserdef{myparser}{initial}{|}{\xparserswitch{bar}}
\xparserdef{myparser}{bar}{|}{\xparserswitch{initial}}

\xparserparse{myparser}{ZZZZa ab|aabaZ|baZbbbbbabaabccccZ}

There are \the\mycount\ Z's.

\end{document} 

在的定义中\xparserdef可能应该检查第二个参数是否不是final

在此处输入图片描述

请注意,|隐藏了第五个Z,而由于我们处于状态,因此忽略了第六个final。宏还允许为“空间”定义一个动作(感谢 Bruno Le Floch 提出建议)。

答案3

到目前为止,已经设法保持解析器安静并通过使用字母表循环\@tfor并创建宏来节省打字时间。

也尝试过 PGF,@foreach但没有成功,希望在这方面得到一些指点。

% Letter definitions
\@tfor\next:=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890[].;:-=\  \do{%
  \def\command@factory#1{%
  \pgfparserdef{myparser}{initial}{\meaning #1}{\textcolor{purple}{#1}}%
  }
 \expandafter\command@factory\next
}

对于空格,如果将其输入为\,它似乎可以工作(但如果解析器不需要这样的手动标记就能工作,效果会更好)。

有趣的是,如果添加\lipsum上面的字母,即

 abcdef\lipsum g...

它将在下面的 MWE 中被解析和扩展为单个字符(它将以紫色完全打印)。

\documentclass{article}
\usepackage{lipsum}
\usepackage{pgf}
\usepackage{soul}
\usepgfmodule{parser}
\usepackage{pgffor}
\begin{document}
\makeatletter
\newcount\mycount

% Letter definitions
\@tfor\next:=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890[].;:-=\lipsum\space\ \do{%
  \def\command@factory#1{%
  \pgfparserdef{myparser}{initial}{\meaning #1}{\textcolor{purple}{#1}}%
  }
 \expandafter\command@factory\next
}

%\foreach \x in {a,...,z} {\command@factory\x}

\pgfparserdef{myparser}{initial}{the letter Z}{\advance\mycount by 1\relax Z}
\pgfparserdef{myparser}{initial}{the letter a}{\hl{a}}
\pgfparserdef{myparser}{initial}{the letter b}{} % do nothing
\pgfparserdef{myparser}{initial}{the letter c}{c} % do nothing
\pgfparserdef{myparser}{initial}{the letter G}{\textcolor{blue}{George}}
\pgfparserdef{myparser}{initial}{the character !}{\pgfparserswitch{final}}% done!

\pgfparserparse{myparser}ZZZZaabaabaZebabbdbQG012\ 345booopsbabaabggg[g][1=;].cccc\lipsum!! 

\end{document}

答案4

\documentclass{article}
\usepackage[dvipsnames]{xcolor}
\usepackage{soul}
\usepackage{ltxkeys}
\makeatletter

\new@def*\cptifcmdeqTF#1{\expandafter\ifcseqTF\cpt@car#1\cpt@quark\car@nil}
\cptswap{ }{\let\cptblankspace= }
\new@def*\cptendparse{\@gobble\cptendparse}
\newletcs\cptstopparse\cptendparse
\ltxkeys@declarekeys*[CPT]{parserparse}[cpt@parser@]{%
  cmd/id/currparser;
  cmd/state/initial;
}
% \cptparserdef{<parserid>}{<state>}{\meaning<token>}{<defn>}
\robust@def\cptparserdef#1#2#3#4{%
  \long\csn@edef{cpt@parser@#1@#2@#3}{\unexpanded{#4}}%
}
% \ParserParseDef{<keyval>}{<tokenlist>}{<defn>}
% Use '#1' in <defn> to access the current token of <tokenlist>.
\robust@def*\ParserParseDef{\cpt@teststopt\cpt@ParserParseDef{}}
\robust@def\cpt@ParserParseDef[#1]#2#3{%
  \let\ifcpt@parser@st\ifcpt@st
  \ltxkeys@launchkeys[CPT]{parserparse}{#1}%
  \ifcpt@parser@st\expandafter\expandafter\fi
  \cpttfor#2\dofor{%
    \cptifcmdeqTF{##1}\cptblankspace{%
      % Current system definition for space token:
      \cptparserdef{\cpt@parser@id}{\cpt@parser@state}
        {blank space\@space\@space}{\@space}%
    }{%
      \edef\parser@tempa{\cpttrimspace{##1}}%
      \edef\parser@tempa{\expandafter\meaning\parser@tempa}%
      \cptparserdef{\cpt@parser@id}{\cpt@parser@state}{\parser@tempa}{#3}%
    }%
  }%
}
% \ParserParseSelectDef{<keyval>}{<tokenlist>}
% <tokenlist> -> {<token>}{<defn>}
% You can use '#1' in <defn> to access the first token of the current
% pair of <tokenlist>.
\robust@def*\ParserParseSelectDef{\cpt@teststopt\cpt@ParserParseSelectDef{}}
\robust@def\cpt@ParserParseSelectDef[#1]#2{%
  \let\ifcpt@parser@st\ifcpt@st
  \ltxkeys@launchkeys[CPT]{parserparse}{#1}%
  \begingroup
  \@tempcnta\z@pt
  \def\parser@do##1{%
    \cptifcmdeqTF{##1}\parser@do{}{%
      \advance\@tempcnta\@ne\parser@do
    }%
  }%
  \ifcpt@parser@st\expandafter\expandafter\fi
  \parser@do#2\parser@do
  \ifodd\@tempcnta
    \cpt@err{User list items not pairwise balanced}
      {List items for \noexpand\ParserParseSelectDef
      must be even in number}%
  \fi
  \endgroup
  \def\parser@do##1##2{%
    \cptifcmdeqTF{##1}\parser@do{}{%
      \cptifcmdeqTF{##1}\cptblankspace{%
        % Current user definition for space token:
        \cptparserdef{\cpt@parser@id}{\cpt@parser@state}
          {blank space\@space\@space}{##2}%
      }{%
        \edef\parser@tempa{\cpttrimspace{##1}}%
        \edef\parser@tempa{\expandafter\meaning\parser@tempa}%
        % This trick is to enable '#1' to be used in <defn> to access the
        % first token of the current pair of <tokenlist>.
        \def\reserved@a####1{\@temptokena{##2}}%
        \reserved@a{##1}%
        \def\reserved@a####1{%
          \cptparserdef{\cpt@parser@id}{\cpt@parser@state}{\parser@tempa}{####1}%
        }%
        \expandafter\reserved@a\expandafter{\the\@temptokena}%
      }%
      \parser@do
    }%
  }%
  \ifcpt@parser@st\expandafter\expandafter\fi
  \parser@do#2\parser@do\parser@do
}
\robust@def*\cptparserparse{\cpt@testopt\cpt@parserparse@a{}}
\robust@def*\cpt@parserparse@a[#1]{%
  \ltxkeys@launchkeys[CPT]{parserparse}{#1}%
  % Gobble any space after ']'. If the user needs any space after ']',
  % he has to insert an explicit space token:
  \expandafter\cpt@parserparse@b\romannumeral-`\q\noexpand
}
\robust@def*\cpt@parserparse@b{%
  \futurelet\cpt@parsersymbol\cpt@parserparse@c
}
\robust@def*\cpt@parserparse@c{%
  \ifx\cpt@parsersymbol\cptendparse
    \let\cpt@parseraction\relax
    \def\cpt@parserparse@d{\let\cpt@parserignore=}%
  \else
    \def\cpt@parserparse@d{%
      \afterassignment\cpt@parserparse@b\let\cpt@parserignore= %
    }%
    \letcstocsn\cpt@parseraction{cpt@parser@\cpt@parser@id
      @\cpt@parser@state @\meaning\cpt@parsersymbol}%
    \ifdefTF\cpt@parseraction{}{%
      \cpt@err{Unexpected character '\meaning\cpt@parsersymbol'
        in parser '\cpt@parser@id' of state '\cpt@parser@state'}\@ehc
    }%
  \fi
  \cpt@parseraction
  \cpt@parserparse@d
}
\edef\cptparserchars{%
  abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ%
  1234567890\cpt@otherchars\noexpand\cptblankspace
}

% Examples:
\newcommand*\sometext[1][1]{%
  \cptdotimes{#1}{Here is some sample text that should fit in the given space.}
}
\cptrobustify\sometext

% Default system definitions; initialization is by the user:
\ParserParseDef*[id=myparser,state=initial]\cptparserchars
{\textcolor{orange}{#1}}

% Peculiar user definitions:
\ParserParseSelectDef[id=myparser,state=initial]{%
  {Z}{\advance\mycount\@ne\textcolor{red}{\fbox{#1}}}
  {a}{\hl{a}} {b}{} {c}{c}
  {G}{\textcolor{blue}{George}}
  {!}{This is exclamation mark.}
  {\cptblankspace}{\textcolor{green}{\texttt{@}}}
  {\sometext}{\textcolor{purple}{#1}}
}

\makeatother

\begin{document}
\newcount\mycount
\noindent
\cptparserparse[id=myparser,state=initial]
Z ABC XYZ aabaaba Z ebab QOG 012345 booops babaab egg [foo]/[1=;].cccc.
\sometext!\cptendparse

\par\medskip\noindent
Number of character \textcolor{red}{\texttt{\fbox{Z}}} in token list: \number\mycount.

\end{document}

在此处输入图片描述

去做

给定的标记列表中的括号组会发生什么情况?例如,{ZZZaababa}以下情况会发生什么情况?

\cptparserparse{myparser}Z {ZZZaababa}.cccc!

解析是否应该在本地重新开始{ZZZaababa}?在选择性清理中,每个大括号组都根据大括号组的特定指令进行处理。

另外,这些说明应该适用于括号组的哪一层嵌套?例如,解析应该进行到多远

\cptparserparse{myparser}Z {{{{{x{ZZZaababa}}}}}}.cccc!

相关内容