如何提取不在环境中但包含定义的命令的一系列文本

如何提取不在环境中但包含定义的命令的一系列文本

在 Windows 8 下,我使用RStudio 中的包Rknitr.Rnw 脚本。我的正则表达式技能还过得去,但仅限于R

我的文档大约有 30 页,每页底部都有不同的文本部分。文本部分以相同的方式开始, \subsubsection{\textcolor{blue}{\textsf{R} Note:并以\clearpage.

如何将“注释:”之后的所有文本部分提取到纯文本文件中?

这是我的 MWE:

\documentclass[11pt]{article}     
\begin{document}

\section{How to Find and Extract Text within Fixed Markers}

\noindent\rule{150pt}{1.7pt}  
\vspace{2pt}
\subsubsection{\textcolor{blue}{\textsf{R} Note: History and Current Status of \textsf{R}}}:  \textsf{R} is an open source software platform used to analyze data.  Written originally in the early 1990's, it is an \textbf{object-oriented}, \textbf{interpreted} programming language blessed with a wealth of constantly-growing and improving "packages" and a vibrant user community.

\clearpage

\section{Each Note is at the bottom of a page.  All have the same start up to Note: They end with clearpage}

\noindent\rule{150pt}{1.7pt}  
\vspace{2pt}
\subsubsection{\textcolor{blue}{\textsf{R} Note: Different text}}:  Some other text.  **There are NO OTHER COMMANDS after the colon except \textsf{R} and all of it consists of ONE LINE OF TEXT.**

\clearpage

\end{document}

这里的几个问题似乎可以指导我,尽管这可能是一个非常简单的问题,但我不知道该怎么做。

使用 xstring 包

从文本字符串中提取

依赖于环境中的文本

这是输出文件:

This is XeTeX, Version 3.14159265-2.6-0.99991 (MiKTeX 2.9 64-bit) (preloaded format=xelatex 2015.12.3)  3 FEB 2016 20:03
entering extended mode
**soquestion.tex
(soquestion.tex
LaTeX2e <2014/05/01>
Babel <3.9l> and hyphenation patterns for 69 languages loaded.
("C:\Program Files\MiKTeX 2.9\tex\latex\base\article.cls"
Document Class: article 2014/09/29 v1.4h Standard LaTeX document class
("C:\Program Files\MiKTeX 2.9\tex\latex\base\size11.clo"
File: size11.clo 2014/09/29 v1.4h Standard LaTeX file (size option)
Requested font "cmr10" at 10.95pt
)
\c@part=\count80
\c@section=\count81
\c@subsection=\count82
\c@subsubsection=\count83
\c@paragraph=\count84
\c@subparagraph=\count85
\c@figure=\count86
\c@table=\count87
\abovecaptionskip=\skip41
\belowcaptionskip=\skip42
\bibindent=\dimen102
)
("C:\Program Files\MiKTeX 2.9\tex\latex\graphics\graphicx.sty"
Package: graphicx 2014/10/28 v1.0g Enhanced LaTeX Graphics (DPC,SPQR)

("C:\Program Files\MiKTeX 2.9\tex\latex\graphics\keyval.sty"
Package: keyval 2014/10/28 v1.15 key=value parser (DPC)
\KV@toks@=\toks14
)
("C:\Program Files\MiKTeX 2.9\tex\latex\graphics\graphics.sty"
Package: graphics 2014/10/28 v1.0p Standard LaTeX Graphics (DPC,SPQR)

("C:\Program Files\MiKTeX 2.9\tex\latex\graphics\trig.sty"
Package: trig 1999/03/16 v1.09 sin cos tan (DPC)
)
("C:\Program Files\MiKTeX 2.9\tex\latex\00miktex\graphics.cfg"
File: graphics.cfg 2007/01/18 v1.5 graphics configuration of teTeX/TeXLive
)
Package graphics Info: Driver file: xetex.def on input line 91.

("C:\Program Files\MiKTeX 2.9\tex\xelatex\xetex-def\xetex.def"
File: xetex.def 2015/03/25 v4.04 LaTeX color/graphics driver for XeTeX (TeX Liv
e/RRM/JK)
))
\Gin@req@height=\dimen103
\Gin@req@width=\dimen104
)
("C:\Program Files\MiKTeX 2.9\tex\latex\graphics\color.sty"
Package: color 2014/10/28 v1.1a Standard LaTeX Color (DPC)

("C:\Program Files\MiKTeX 2.9\tex\latex\00miktex\color.cfg"
File: color.cfg 2007/01/18 v1.5 color configuration of teTeX/TeXLive
)
Package color Info: Driver file: xetex.def on input line 137.
) (framed.sty
Package: framed 2011/10/22 v 0.96: framed or shaded text with page breaks
\OuterFrameSep=\skip43
\fb@frw=\dimen105
\fb@frh=\dimen106
\FrameRule=\dimen107
\FrameSep=\dimen108
)
("C:\Program Files\MiKTeX 2.9\tex\latex\base\alltt.sty"
Package: alltt 1997/06/16 v2.0g defines alltt environment
)
("C:\Program Files\MiKTeX 2.9\tex\latex\upquote\upquote.sty"
Package: upquote 2012/04/19 v1.3 upright-quote and grave-accent glyphs in verba
tim
)

LaTeX Warning: Unused global option(s):
    [table].

(soquestion.aux)
LaTeX Font Info:    Checking defaults for OML/cmm/m/it on input line 52.
LaTeX Font Info:    ... okay on input line 52.
LaTeX Font Info:    Checking defaults for T1/cmr/m/n on input line 52.
LaTeX Font Info:    ... okay on input line 52.
LaTeX Font Info:    Checking defaults for OT1/cmr/m/n on input line 52.
LaTeX Font Info:    ... okay on input line 52.
LaTeX Font Info:    Checking defaults for OMS/cmsy/m/n on input line 52.
LaTeX Font Info:    ... okay on input line 52.
LaTeX Font Info:    Checking defaults for OMX/cmex/m/n on input line 52.
LaTeX Font Info:    ... okay on input line 52.
LaTeX Font Info:    Checking defaults for U/cmr/m/n on input line 52.
LaTeX Font Info:    ... okay on input line 52.
Requested font "cmr12" at 14.4pt
Requested font "cmbx12" at 14.4pt
Requested font "cmbx10" at 10.95pt
Requested font "cmssbx10" at 10.95pt
Requested font "cmss10" at 10.95pt
 [1

]
Overfull \hbox (21.13795pt too wide) in paragraph at lines 62--62
\OT1/cmr/bx/n/14.4 the same start up to Note: They end with clearpage 
 []

[2

] (soquestion.aux) ) 
Here is how much of TeX's memory you used:
 715 strings out of 428783
 8144 string characters out of 3164549
 61307 words of memory out of 3000000
 4071 multiletter control sequences out of 15000+200000
 5452 words of font info for 20 fonts, out of 3000000 for 9000
 1096 hyphenation exceptions out of 8191
 25i,5n,21p,416b,117s stack positions out of 5000i,500n,10000p,200000b,50000s

Output written on soquestion.pdf (2 pages).

答案1

由于文件创建是自动的,我们可以依靠一致的使用,并利用这一点来获得优势。为此,我

  • 假设你绝不使用\subsubsection*{..}\subsubsection[.]{..}

  • 假设你会总是\subsubsection{..} ... \clearpage对于任何都有\subsubsection。也就是说,\subsubsection标题后的文本在开头以 分隔\subsubsection{..},在结尾以分隔\clearpage

\documentclass{article}
\usepackage{xcolor}

\let\oldsubsubsection\subsubsection
\makeatletter
\long\def\subsubsection#1#2\clearpage{%
  \begingroup
  \let\textcolor\@secondoftwo% Extract <text> from \textcolor{<color>}{<text>}
  \let\textsf\@firstofone% Ignore \textsf
  \let\textbf\@firstofone% Ignore \textbf
  %\let\par\space% If \par is a problem in the output
  % ... Ignore other commands
  \xdef\@@x{#2}% Extract entire <title> in \subsubsection{<title>}
  \immediate\write\subsubsectiontextfile{\@@x}% Write <title> to file
  \endgroup
  \oldsubsubsection{#1}#2\clearpage% Regular \subsubsection
}
\makeatother
\AtBeginDocument{%
  \newwrite\subsubsectiontextfile
  \immediate\openout\subsubsectiontextfile=\jobname.sst}
\AtEndDocument{\immediate\closeout\subsubsectiontextfile}

\begin{document}

\section{How to Find and Extract Text within Fixed Markers}

\noindent\rule{150pt}{1.7pt}  
\vspace{2pt}
\subsubsection{\textcolor{blue}{\textsf{R} Note: History and Current Status of \textsf{R}}}:
\textsf{R} is an open source software platform used to analyze data.  
Written originally in the early 1990's, it is an \textbf{object-oriented}, \textbf{interpreted} 
programming language blessed with a wealth of constantly-growing and improving "packages" and 
a vibrant user community.

\clearpage

\section{Each Note is at the bottom of a page.  All have the same start up to Note: They end with clearpage}

\noindent\rule{150pt}{1.7pt}  
\vspace{2pt}
\subsubsection{\textcolor{blue}{\textsf{R} Note: Different text}}:  
Some other text. **There are NO OTHER COMMANDS after the colon except \textsf{R} and all of it consists 
of ONE LINE OF TEXT.**

\clearpage

\end{document}

上面产生<jobname>.sst

: R is an open source software platform used to analyze data. Written originally in the early 1990's, it is an object-oriented, interpreted programming language blessed with a wealth of constantly-growing and improving "packages" and a vibrant user community. \par 
: Some other text. **There are NO OTHER COMMANDS after the colon except R and all of it consists of ONE LINE OF TEXT.** \par 

\begingroup...之间的部分\endgroup用于删除文本中可能包含的所有格式。在我们的例子中,它删除了to和和to\subsubsection的功能(在这两种情况下,它们都只返回)。您可以在此处添加更多此类无操作。\textcolor{<colour>}{<text>}\@secondoftwo\textsf{<text>}\textbf{<text>}\@firstofone<text>

一旦提取,我们立即将其写入文件。只需打开此文件然后关闭它<jobname>.sst即可,因为您对它所在的页面不感兴趣。\AtBeginDocument\AtEndDocument

不确定包含 是否\par有问题。您可以添加\let\par\space以转换\par为常规空间

答案2

如果我们获取原始问题中给出的精确文件(例如,名为myfile.tex),并将其输入sed像这样:

sed -n '/\\subsubsection.*Note:/,/\\clearpage/p' myfile.tex

你将获得以下输出:

\subsubsection{\textcolor{blue}{\textsf{R} Note: History and Current Status of \textsf{R}}}:  \textsf{R} is an open source software platform used to analyze data.  Written originally in the early 1990's, it is an \textbf{object-oriented}, \textbf{interpreted} programming language blessed with a wealth of constantly-growing and improving ``packages'' and a vibrant user community.

\clearpage
\subsubsection{\textcolor{blue}{\textsf{R} Note: Different text}}:  Some other text.  **There are NO OTHER COMMANDS after the colon except \textsf{R} and all of it consists of ONE LINE OF TEXT.**

\clearpage

现在,我们要删除到 的部分Note:和 的部分\clearpage,这可以通过将其传输到来实现:

sed 's/\\subsubsection.*Note: //g;s/\\clearpage//g'

得出的结果为:

 History and Current Status of \textsf{R}}}:  \textsf{R} is an open source software platform used to analyze data.  Written originally in the early 1990's, it is an \textbf{object-oriented}, \textbf{interpreted} programming language blessed with a wealth of constantly-growing and improving ``packages'' and a vibrant user community.


Different text}}:  Some other text.  **There are NO OTHER COMMANDS after the colon except \textsf{R} and all of it consists of ONE LINE OF TEXT.**

最后,我们要去掉一些TeX东西。将结果通过管道传输到命令detex(TeX Live 的一部分)将会这样做。

把它们放在一起看起来就像这样:

sed -n '/\\subsubsection.*Note:/,/\\clearpage/p' myfile.tex | sed 's/\\subsubsection.*Note: //g;s/\\clearpage//g' | detex

其结果是:

History and Current Status of R:  R is an open source software platform used to analyze data.  Written originally in the early 1990's, it is an object-oriented, interpreted programming language blessed with a wealth of constantly-growing and improving ``packages'' and a vibrant user community.


Different text:  Some other text.  **There are NO OTHER COMMANDS after the colon except R and all of it consists of ONE LINE OF TEXT.**

答案3

由于是 Windows 8,因此应该包含 PowerShell,但目前我只能轻松访问 Windows 10 进行测试。以下是您可以在 PowerShell 提示符下运行的内容,假设将提取单行字符串。

给定一个源 .tex 文件

\documentclass[11pt]{article}     
\begin{document}
\section{How to Find and Extract Text within Fixed Markers}
\noindent\rule{150pt}{1.7pt}  
\vspace{2pt}
\subsubsection{\textcolor{blue}{\textsf{R} Note: History and Current Status of \textsf{R}}}:  \textsf{R} is an open source software platform used to analyze data.  Written originally in the early 1990's, it is an \textbf{object-oriented}, \textbf{interpreted} programming language blessed with a wealth of constantly-growing and improving "packages" and a vibrant user community.
\clearpage
\section{Each Note is at the bottom of a page.  All have the same start up to Note: They end with clearpage}
\noindent\rule{150pt}{1.7pt}  
\vspace{2pt}
\subsubsection{\textcolor{blue}{\textsf{R} Note: Different text}}:  Some other text.
\subsubsection{\textcolor{blue}{\textsf{R} Note: History and Current Status of \textsf{R}}}:  \textsf{R} is an open source software platform used to analyze data.  Written originally in the early 1990's, it is an \textbf{object-oriented}, \textbf{interpreted} programming language blessed with a wealth of constantly-growing and improving "packages" and a vibrant user community.
\clearpage
\end{document}

您可以在 PowerShell 提示符下运行以下命令:

sls'^\\subsubsection'290814.tex`
| 选择 -ExpandProperty 行 `
| foreach { 写入主机 ($_ -split ': ')[1] }

并得到结果

\textsf{R} 是一个用于分析数据的开源软件平台。它最初于 20 世纪 90 年代初编写,是一种 \textbf{面向对象}、\textbf{解释型} 编程语言,拥有大量不断增长和改进的“软件包”以及活跃的用户社区。
一些其他文本。
\textsf{R} 是一个用于分析数据的开源软件平台。它最初于 20 世纪 90 年代初编写,是一种 \textbf{面向对象}、\textbf{解释型} 编程语言,拥有大量不断增长和改进的“软件包”以及活跃的用户社区。

我绝对不是 PowerShell 专家,但简单的解释是:

  • sls是 的别名select-string,大致相当于 Unix 的grep
  • select行从结果行中删除文件名和行号前缀
  • 迭代foreach输出的每一行,按序列分割行:,并返回第二行。

相关内容