动态计算并返回某一部分的字数

动态计算并返回某一部分的字数

我正在写一份提案,需要在提案本身中给出主要部分的字数。我找到了关于如何在 LaTeX 文档中获取字数的答案:

有没有什么方法可以正确统计 LaTeX 文档的字数?

但这仅适用于整个文档。我想知道是否有可能在 LaTeX 文档中获取文本部分/章节的字数,并在文本中实际返回该字数。

我愿意听取有关 Sweave 的建议。我能想到的一件事是将整个部分写在 Sweave 块中的 R 字符串中。然后只需从中获取字数并返回字符串即可。但是,这将包括所有使用的 LaTeX 代码,而不仅仅是预期的单词。我想到的另一种解决方案是扩展此 R 代码以将字符串包装在一个空文档中,编译它并使用可用的工具从中获取字数,然后返回它。这似乎可行,但我希望有一个更简单的解决方案。

答案1

您可以使用它texcount来统计单词数。它会自动为各个部分生成子计数。

这是一个新的宏,它调用texcount,提取当前部分的子计数,然后将字数插入文档。它需要write18启用,并且texcount必须位于您的路径中(或者您必须在宏中包含可执行文件的完整路径)。

\documentclass{article}
\newcommand\wordcount{
    \immediate\write18{texcount -sub=section \jobname.tex  | grep "Section" | sed -e 's/+.*//' | sed -n \thesection p > 'count.txt'}
(\input{count.txt}words)}

\begin{document}
\section{Introduction}
In publishing and graphic design, lorem ipsum is placeholder text (filler text) commonly used to demonstrate the graphics elements of a document or visual presentation, such as font, typography, and layout. The lorem ipsum text is typically a section of a Latin text by Cicero with words altered, added and removed that make it nonsensical in meaning and not proper Latin.

\wordcount
\section{Main Stuff}
Even though "lorem ipsum" may arouse curiosity because of its resemblance to classical Latin, it is not intended to have meaning. Where text is comprehensible in a document, people tend to focus on the textual content rather than upon overall presentation, so publishers use lorem ipsum when displaying a typeface or design elements and page layout in order to direct the focus to the publication style and not the meaning of the text. In spite of its basis in Latin, use of lorem ipsum is often referred to as greeking, from the phrase "it's all Greek to me," which indicates that this is not meant to be readable text.

 \wordcount
\section{Conclusion}
Today's popular version of lorem ipsum was first created for Aldus Corporation's first desktop publishing program Aldus PageMaker in the mid-1980s for the Apple Macintosh. Art director Laura Perry adapted older forms of the lorem text from typography samples — it was, for example, widely used in Letraset catalogs in the 1960s and 1970s (anecdotes suggest that the original use of the "Lorem ipsum" text was by Letraset, which was used for print layouts by advertising agencies as early as the 1970s.) The text was frequently used in PageMaker templates.

\wordcount
\end{document}

答案2

这是使用我的tokcycle软件包解决该问题的纯 LaTeX 解决方案(我甚至在最后提供了一个纯 TeX 解决方案!!)。您将需要从 2021-03-10 开始的新版本,因为我使用新\xtokcycleenvironment功能定义我的环境,它不仅允许定义 token-cycle 指令,还允许在 token-cycle 之前和之后执行额外的“设置”和“关闭”代码。

我注意到,即使在回答更广泛的问题中,有没有什么方法可以正确统计 LaTeX 文档的字数?,该问题没有其他纯乳胶解决方案。

这种方法引入了伪环境\countem...\endcountem,它将同时计算单词(定义为由 cat-11 标记以外的任何内容界定的 cat-11 标记字符串 - 因此,带连字符的复合词分别计算)和字母(只有 cat-11 标记才算作字母)。因为它是一个环境,所以您可以将其应用于文档的一部分,并在文档内多次应用它。

在最简单的用法中,环境将填充两个计数器,wordcountlettercount填充您可以测试、排版或执行其他操作的值。但它提供了几个高级功能。

  1. 如果将计数器设置为正数,则当运行字数超过时,wordlimit它将把计数文本的颜色更改为\limitcolor(默认)red\value{wordlimit}

  2. 如果您设置了\runningcounttrue,您将获得一个上标单词,并且每个单词后都会排版字母计数。

  3. 如果您设置了\summarycounttrue,您将获得在环境结束时排版的总单词数和字母数。

环境可以消化和处理输入流中的宏,本质上将它们回显到输出(而不将宏标记计为字母或单词)。但要小心!包含 cat-11 标记的宏参数将计为字母和单词。此外,如果\runningcountrue,添加的上标将污染宏参数。解决这些问题的方法是使用tokcycle转义功能,|...|它将...标记回显到输出,而无需通过标记循环指令处理它们。示例:counted input |\rule{2ex}{1ex}| more counted input

一个非常好的功能(由tokcycle逻辑启用)是组不会对功能造成任何问题。例如,如果您超出参数wordlimit内部\textit,则颜色会在那里更改,并且即使\textit退出参数也会继续更改。

如果想要对整个文档进行计数,只需\countem立即从之后开始\begin{document},然后\endcountem立即在之前开始\end{document}

\documentclass{article}
\usepackage{tokcycle}[2021-03-10]
\usepackage{xcolor}
\newcounter{wordcount}
\newcounter{lettercount}
\newcounter{wordlimit}
\newif\ifinword
% USER PARAMETERS
\newif\ifrunningcount
\newif\ifsummarycount
\def\limitcolor{red}
\setcounter{wordlimit}{0}
%%
\makeatletter
% \tc@defx is like \def, but expands the replacement text once prior to assignment
\newcommand\addtomacro[2]{\tc@defx#1{#1#2}}
\newcommand\changecolor[1]{\tctestifx{.#1}{}{\addcytoks{\color{#1}{}}%
  \tc@defx\currentcolor{#1}}}
\makeatother
\newcommand\dumpword{%
  \addcytoks[1]{\accumword}%
  \ifinword\stepcounter{wordcount}
    \ifrunningcount\addcytoks[x]{$^{\thewordcount,\thelettercount}$}\fi
    \ifnum\thewordcount=\value{wordlimit}\relax\changecolor{\limitcolor}\fi
  \fi%
  \inwordfalse
  \def\accumword{}}
\newcommand\addletter[1]{%
  \tctestifcatnx A#1{\stepcounter{lettercount}\inwordtrue}{\dumpword}%
  \addtomacro\accumword{#1}}
\xtokcycleenvironment\countem
  {\addletter{##1}}
  {\dumpword\groupedcytoks{\processtoks{##1}\dumpword\expandafter}\expandafter
    \changecolor\expandafter{\currentcolor}}
  {\dumpword\addcytoks{##1}}
  {\dumpword\addcytoks{##1}}
  {\stripgroupingtrue\def\accumword{}\def\currentcolor{.}
    \setcounter{wordcount}{0}\setcounter{lettercount}{0}}
  {\dumpword\ifsummarycount\tcafterenv{%
    \par(Wordcount=\thewordcount, Lettercount=\thelettercount)}\fi}
\begin{document}
\countem 
This is a test if word counting occurs.
\endcountem
I know there were \thewordcount{} words and \thelettercount{} letters in the
  prior sentence.

\bigskip\setcounter{wordlimit}{7}
\countem 
This is a test if color changes after seven words.
\endcountem

\bigskip\runningcounttrue
\countem
This is a running count.

But...punctuation does not count as characters.
\endcountem

\bigskip
\countem 
This \textbf{is a} running count. |\textsc{Skipping macros 
\rule{1ex}{1.5ex}/other text}|

But...\textit{punctuation does} not count as characters.
\endcountem

\bigskip\runningcountfalse\summarycounttrue\setcounter{wordlimit}{125}
\countem
As any dedicated reader can clearly see, the Ideal of 
practical reason is a representation of, as far as I know, the things
in themselves; as I have shown elsewhere, the phenomena should only be
used as a canon for our understanding. The paralogisms of practical
reason are what first give rise to the architectonic of practical
reason. As will easily be shown in the next section, reason would
thereby be made to contradict, in view of these considerations, the
Ideal of practical reason, yet the manifold depends on the phenomena.
Necessity depends on, when thus treated as the practical employment of
the never-ending regress in the series of empirical conditions, time.
Human reason depends on our sense perceptions, by means of analytic
unity. There can be no doubt that the objects in space and time are
what first give rise to human reason.

Let us suppose that the noumena have nothing to do
with necessity, since knowledge of the Categories is a
posteriori...
\endcountem
\end{document}

在 MWE 中,第一个例子是基本的例子。句子的单词和字母被计数并在wordcount和中提供lettercount

在第二个例子中,wordlimit设置后我们观察到文本的颜色在 7 个单词之后发生变化。

在第三个示例中,启用了计数。可以观察到空格和标点不算作字母,但可以用作分隔单词(在 的情况下...)。

在第四个示例中,演示了几个附加功能:包含宏序列的文本在|...|分隔符之间进行转义,以避免污染宏参数。此外,我们注意到斜体内部的颜色变化团体文本。尽管如此,团体结束行为符合预期,并且不会对单词/字母计数造成干扰,也不会对 造成干扰limitcolor

在第 5 个也是最后一个例子中,由于文本块较长,我们提高wordlimit并关闭运行计数,而是采用\countem文本块的摘要计数。

在此处输入图片描述

奖励:纯 TeX 解决方案!!

这是针对 Plain TeX 改编的上述答案。(感谢 David 提供的聊天帮助\input color

%%%
% MAKE COLOR AVAILABLE
\input color    
%%%
%%% COLOR INPUT COMPLETE, NOW LET"S CREATE A TOKEN-CYCLE
%
\input tokcycle
\newcount\wordcount
\newcount\lettercount
\newcount\wordlimit
\newif\ifinword
% USER PARAMETERS
\newif\ifrunningcount
\newif\ifsummarycount
\def\limitcolor{red}
\wordlimit=0\relax
%%
\catcode`@=11
% \tc@defx is like \def, but expands the replacement text once prior to assignment
\def\addtomacro#1#2{\tc@defx#1{#1#2}}
\def\changecolor#1{\tctestifx{.#1}{}{\addcytoks{\color{#1}{}}%
  \tc@defx\countemcurrentcolor{#1}}}
\catcode`@=12
\def\dumpword{%
  \addcytoks[1]{\accumword}%
  \ifinword\global\advance\wordcount 1\relax
    \ifrunningcount\addcytoks[x]{$^{\the\wordcount,\the\lettercount}$}\fi
    \ifnum\wordcount=\wordlimit\relax\changecolor{\limitcolor}\fi
  \fi%
  \inwordfalse
  \def\accumword{}}
\def\addletter#1{%
  \tctestifcatnx A#1{\global\advance\lettercount 1\relax\inwordtrue}{\dumpword}%
  \addtomacro\accumword{#1}}
\xtokcycleenvironment\countem
  {\addletter{##1}}
  {\dumpword\groupedcytoks{\processtoks{##1}\dumpword\expandafter}\expandafter
    \changecolor\expandafter{\countemcurrentcolor}}
  {\dumpword\addcytoks{##1}}
  {\dumpword\addcytoks{##1}}
  {\stripgroupingtrue\def\accumword{}\def\countemcurrentcolor{.}
    \global\wordcount=0\relax\global\lettercount 0\relax}
  {\dumpword\ifsummarycount\tcafterenv{%
    \par(Wordcount=\the\wordcount, Lettercount=\the\lettercount)}\fi}

% SETUP COMPLETE.  NOW FOR THE DOCUMENT

\countem 
This is a test if word counting occurs.
\endcountem
I know there were \the\wordcount{} words and \the\lettercount{} letters in the
  prior sentence.

\bigskip\wordlimit=7\relax
\countem 
This is a test if color changes after seven words.
\endcountem

\bigskip\runningcounttrue
\countem
This is a running count.

But...punctuation does not count as characters.
\endcountem

\bigskip
\countem 
This {\bf is a} running count. |{Skipping macros 
rule{1ex}{1.5ex}/other text}|

But...{\it punctuation does} not count as characters.
\endcountem

\bigskip\runningcountfalse\summarycounttrue\wordlimit=125\relax
\countem
As any dedicated reader can clearly see, the Ideal of 
practical reason is a representation of, as far as I know, the things
in themselves; as I have shown elsewhere, the phenomena should only be
used as a canon for our understanding. The paralogisms of practical
reason are what first give rise to the architectonic of practical
reason. As will easily be shown in the next section, reason would
thereby be made to contradict, in view of these considerations, the
Ideal of practical reason, yet the manifold depends on the phenomena.
Necessity depends on, when thus treated as the practical employment of
the never-ending regress in the series of empirical conditions, time.
Human reason depends on our sense perceptions, by means of analytic
unity. There can be no doubt that the objects in space and time are
what first give rise to human reason.

Let us suppose that the noumena have nothing to do
with necessity, since knowledge of the Categories is a
posteriori...
\endcountem
\bye

在此处输入图片描述

答案3

您要求提供 LaTeX 解决方案,但为了完善,我还将提供 ConTeXt 解决方案,这可能会对某些人有所帮助。它不依赖于外部程序。

\startluacode
    userdata = userdata or { }

    function userdata.wordcount(listname)
        filename = file.addsuffix(tex.jobname,"words")
        if lfs.isfile(filename) then
            local w = dofile(filename)
            if w then
                if type(w.categories[listname]) == "table" then
                    context(w.categories[listname].total)
                else
                    context(w.total)
                end
                context.par()
            end
        end
    end
\stopluacode

\def\wordcount{%
    \dosingleempty\dowordcount}

\def\dowordcount[#1]{%
    \ctxlua{userdata.wordcount("#1")}}

\starttext

% Set up the word count
\ctxlua{languages.words.threshold=2}
\setupspellchecking [state=start, method=2]

\setupspellchecking [list=foo]
\startsection [title=Foo]
    Foo Bar
\stopsection

\setupspellchecking [list=lorem]
\startsection [title=Lorem]
    Lorem ipsum dolor sit
\stopsection

\setupspellchecking [list=stop]

\startsubject [title=Statistics]
    Words in Foo:   \wordcount [foo]
    Words in Lorem: \wordcount [lorem]
\stopsubject

\stoptext

结果:

结果

\jobname.wordslua 函数读取由命令创建的外部文件\setupspellchecking。它返回一个包含一些统计信息的 lua 表并提取相关数据。\wordcount这只是一个不错的包装器,用于保持界面“上下文化”。

默认情况下,只计算至少四个字符的单词。在示例中,我将阈值设置为两个字符。

注意:示例中,章节标题中的文本也被计算在内。与通常的 ConTeXt 行为不同,此字数统计实现需要两次上下文运行。

该代码是 ConTeXt 分布 () 中代码的修改版本s-lan-03

答案4

如果您使用带有 Auctex 的 Emacs,则可以选择当前区域的文本并在其上运行 Latex(通常通过C-c C-r键绑定,映射到TeX-command-region)。这将创建一个_region_.tex可以编译或texcount编辑的文件,其主体仅包含突出显示的 Latex 文档部分;同样,可以使用 tex4ht 或 PDF 工具等可以对输出进行字数统计的工具来生成各种各样的字数估计值。

其他几个 Latex 编辑器支持类似的“区域编译”,例如 Kile。请参阅LaTeX 编辑器/IDE

相关内容