在 Windows 8 下,我使用RStudio 中的包R
和knitr
.Rnw 脚本。我的正则表达式技能还过得去,但仅限于R
。
我的文档大约有 30 页,每页底部都有不同的文本部分。文本部分以相同的方式开始, \subsubsection{\textcolor{blue}{\textsf{R} Note:
并以\clearpage.
如何将“注释:”之后的所有文本部分提取到纯文本文件中?
这是我的 MWE:
\documentclass[11pt]{article}
\begin{document}
\section{How to Find and Extract Text within Fixed Markers}
\noindent\rule{150pt}{1.7pt}
\vspace{2pt}
\subsubsection{\textcolor{blue}{\textsf{R} Note: History and Current Status of \textsf{R}}}: \textsf{R} is an open source software platform used to analyze data. Written originally in the early 1990's, it is an \textbf{object-oriented}, \textbf{interpreted} programming language blessed with a wealth of constantly-growing and improving "packages" and a vibrant user community.
\clearpage
\section{Each Note is at the bottom of a page. All have the same start up to Note: They end with clearpage}
\noindent\rule{150pt}{1.7pt}
\vspace{2pt}
\subsubsection{\textcolor{blue}{\textsf{R} Note: Different text}}: Some other text. **There are NO OTHER COMMANDS after the colon except \textsf{R} and all of it consists of ONE LINE OF TEXT.**
\clearpage
\end{document}
这里的几个问题似乎可以指导我,尽管这可能是一个非常简单的问题,但我不知道该怎么做。
这是输出文件:
This is XeTeX, Version 3.14159265-2.6-0.99991 (MiKTeX 2.9 64-bit) (preloaded format=xelatex 2015.12.3) 3 FEB 2016 20:03
entering extended mode
**soquestion.tex
(soquestion.tex
LaTeX2e <2014/05/01>
Babel <3.9l> and hyphenation patterns for 69 languages loaded.
("C:\Program Files\MiKTeX 2.9\tex\latex\base\article.cls"
Document Class: article 2014/09/29 v1.4h Standard LaTeX document class
("C:\Program Files\MiKTeX 2.9\tex\latex\base\size11.clo"
File: size11.clo 2014/09/29 v1.4h Standard LaTeX file (size option)
Requested font "cmr10" at 10.95pt
)
\c@part=\count80
\c@section=\count81
\c@subsection=\count82
\c@subsubsection=\count83
\c@paragraph=\count84
\c@subparagraph=\count85
\c@figure=\count86
\c@table=\count87
\abovecaptionskip=\skip41
\belowcaptionskip=\skip42
\bibindent=\dimen102
)
("C:\Program Files\MiKTeX 2.9\tex\latex\graphics\graphicx.sty"
Package: graphicx 2014/10/28 v1.0g Enhanced LaTeX Graphics (DPC,SPQR)
("C:\Program Files\MiKTeX 2.9\tex\latex\graphics\keyval.sty"
Package: keyval 2014/10/28 v1.15 key=value parser (DPC)
\KV@toks@=\toks14
)
("C:\Program Files\MiKTeX 2.9\tex\latex\graphics\graphics.sty"
Package: graphics 2014/10/28 v1.0p Standard LaTeX Graphics (DPC,SPQR)
("C:\Program Files\MiKTeX 2.9\tex\latex\graphics\trig.sty"
Package: trig 1999/03/16 v1.09 sin cos tan (DPC)
)
("C:\Program Files\MiKTeX 2.9\tex\latex\00miktex\graphics.cfg"
File: graphics.cfg 2007/01/18 v1.5 graphics configuration of teTeX/TeXLive
)
Package graphics Info: Driver file: xetex.def on input line 91.
("C:\Program Files\MiKTeX 2.9\tex\xelatex\xetex-def\xetex.def"
File: xetex.def 2015/03/25 v4.04 LaTeX color/graphics driver for XeTeX (TeX Liv
e/RRM/JK)
))
\Gin@req@height=\dimen103
\Gin@req@width=\dimen104
)
("C:\Program Files\MiKTeX 2.9\tex\latex\graphics\color.sty"
Package: color 2014/10/28 v1.1a Standard LaTeX Color (DPC)
("C:\Program Files\MiKTeX 2.9\tex\latex\00miktex\color.cfg"
File: color.cfg 2007/01/18 v1.5 color configuration of teTeX/TeXLive
)
Package color Info: Driver file: xetex.def on input line 137.
) (framed.sty
Package: framed 2011/10/22 v 0.96: framed or shaded text with page breaks
\OuterFrameSep=\skip43
\fb@frw=\dimen105
\fb@frh=\dimen106
\FrameRule=\dimen107
\FrameSep=\dimen108
)
("C:\Program Files\MiKTeX 2.9\tex\latex\base\alltt.sty"
Package: alltt 1997/06/16 v2.0g defines alltt environment
)
("C:\Program Files\MiKTeX 2.9\tex\latex\upquote\upquote.sty"
Package: upquote 2012/04/19 v1.3 upright-quote and grave-accent glyphs in verba
tim
)
LaTeX Warning: Unused global option(s):
[table].
(soquestion.aux)
LaTeX Font Info: Checking defaults for OML/cmm/m/it on input line 52.
LaTeX Font Info: ... okay on input line 52.
LaTeX Font Info: Checking defaults for T1/cmr/m/n on input line 52.
LaTeX Font Info: ... okay on input line 52.
LaTeX Font Info: Checking defaults for OT1/cmr/m/n on input line 52.
LaTeX Font Info: ... okay on input line 52.
LaTeX Font Info: Checking defaults for OMS/cmsy/m/n on input line 52.
LaTeX Font Info: ... okay on input line 52.
LaTeX Font Info: Checking defaults for OMX/cmex/m/n on input line 52.
LaTeX Font Info: ... okay on input line 52.
LaTeX Font Info: Checking defaults for U/cmr/m/n on input line 52.
LaTeX Font Info: ... okay on input line 52.
Requested font "cmr12" at 14.4pt
Requested font "cmbx12" at 14.4pt
Requested font "cmbx10" at 10.95pt
Requested font "cmssbx10" at 10.95pt
Requested font "cmss10" at 10.95pt
[1
]
Overfull \hbox (21.13795pt too wide) in paragraph at lines 62--62
\OT1/cmr/bx/n/14.4 the same start up to Note: They end with clearpage
[]
[2
] (soquestion.aux) )
Here is how much of TeX's memory you used:
715 strings out of 428783
8144 string characters out of 3164549
61307 words of memory out of 3000000
4071 multiletter control sequences out of 15000+200000
5452 words of font info for 20 fonts, out of 3000000 for 9000
1096 hyphenation exceptions out of 8191
25i,5n,21p,416b,117s stack positions out of 5000i,500n,10000p,200000b,50000s
Output written on soquestion.pdf (2 pages).
答案1
由于文件创建是自动的,我们可以依靠一致的使用,并利用这一点来获得优势。为此,我
假设你绝不使用
\subsubsection*{..}
或\subsubsection[.]{..}
假设你会总是
\subsubsection{..} ... \clearpage
对于任何都有\subsubsection
。也就是说,\subsubsection
标题后的文本在开头以 分隔\subsubsection{..}
,在结尾以分隔\clearpage
。
\documentclass{article}
\usepackage{xcolor}
\let\oldsubsubsection\subsubsection
\makeatletter
\long\def\subsubsection#1#2\clearpage{%
\begingroup
\let\textcolor\@secondoftwo% Extract <text> from \textcolor{<color>}{<text>}
\let\textsf\@firstofone% Ignore \textsf
\let\textbf\@firstofone% Ignore \textbf
%\let\par\space% If \par is a problem in the output
% ... Ignore other commands
\xdef\@@x{#2}% Extract entire <title> in \subsubsection{<title>}
\immediate\write\subsubsectiontextfile{\@@x}% Write <title> to file
\endgroup
\oldsubsubsection{#1}#2\clearpage% Regular \subsubsection
}
\makeatother
\AtBeginDocument{%
\newwrite\subsubsectiontextfile
\immediate\openout\subsubsectiontextfile=\jobname.sst}
\AtEndDocument{\immediate\closeout\subsubsectiontextfile}
\begin{document}
\section{How to Find and Extract Text within Fixed Markers}
\noindent\rule{150pt}{1.7pt}
\vspace{2pt}
\subsubsection{\textcolor{blue}{\textsf{R} Note: History and Current Status of \textsf{R}}}:
\textsf{R} is an open source software platform used to analyze data.
Written originally in the early 1990's, it is an \textbf{object-oriented}, \textbf{interpreted}
programming language blessed with a wealth of constantly-growing and improving "packages" and
a vibrant user community.
\clearpage
\section{Each Note is at the bottom of a page. All have the same start up to Note: They end with clearpage}
\noindent\rule{150pt}{1.7pt}
\vspace{2pt}
\subsubsection{\textcolor{blue}{\textsf{R} Note: Different text}}:
Some other text. **There are NO OTHER COMMANDS after the colon except \textsf{R} and all of it consists
of ONE LINE OF TEXT.**
\clearpage
\end{document}
上面产生<jobname>.sst
了
: R is an open source software platform used to analyze data. Written originally in the early 1990's, it is an object-oriented, interpreted programming language blessed with a wealth of constantly-growing and improving "packages" and a vibrant user community. \par
: Some other text. **There are NO OTHER COMMANDS after the colon except R and all of it consists of ONE LINE OF TEXT.** \par
\begingroup
...之间的部分\endgroup
用于删除文本中可能包含的所有格式。在我们的例子中,它删除了to和和to\subsubsection
的功能(在这两种情况下,它们都只返回)。您可以在此处添加更多此类无操作。\textcolor{<colour>}{<text>}
\@secondoftwo
\textsf{<text>}
\textbf{<text>}
\@firstofone
<text>
一旦提取,我们立即将其写入文件。只需打开此文件然后关闭它<jobname>.sst
即可,因为您对它所在的页面不感兴趣。\AtBeginDocument
\AtEndDocument
不确定包含 是否\par
有问题。您可以添加\let\par\space
以转换\par
为常规空间。
答案2
如果我们获取原始问题中给出的精确文件(例如,名为myfile.tex
),并将其输入sed
像这样:
sed -n '/\\subsubsection.*Note:/,/\\clearpage/p' myfile.tex
你将获得以下输出:
\subsubsection{\textcolor{blue}{\textsf{R} Note: History and Current Status of \textsf{R}}}: \textsf{R} is an open source software platform used to analyze data. Written originally in the early 1990's, it is an \textbf{object-oriented}, \textbf{interpreted} programming language blessed with a wealth of constantly-growing and improving ``packages'' and a vibrant user community.
\clearpage
\subsubsection{\textcolor{blue}{\textsf{R} Note: Different text}}: Some other text. **There are NO OTHER COMMANDS after the colon except \textsf{R} and all of it consists of ONE LINE OF TEXT.**
\clearpage
现在,我们要删除到 的部分Note:
和 的部分\clearpage
,这可以通过将其传输到来实现:
sed 's/\\subsubsection.*Note: //g;s/\\clearpage//g'
得出的结果为:
History and Current Status of \textsf{R}}}: \textsf{R} is an open source software platform used to analyze data. Written originally in the early 1990's, it is an \textbf{object-oriented}, \textbf{interpreted} programming language blessed with a wealth of constantly-growing and improving ``packages'' and a vibrant user community.
Different text}}: Some other text. **There are NO OTHER COMMANDS after the colon except \textsf{R} and all of it consists of ONE LINE OF TEXT.**
最后,我们要去掉一些TeX
东西。将结果通过管道传输到命令detex
(TeX Live 的一部分)将会这样做。
把它们放在一起看起来就像这样:
sed -n '/\\subsubsection.*Note:/,/\\clearpage/p' myfile.tex | sed 's/\\subsubsection.*Note: //g;s/\\clearpage//g' | detex
其结果是:
History and Current Status of R: R is an open source software platform used to analyze data. Written originally in the early 1990's, it is an object-oriented, interpreted programming language blessed with a wealth of constantly-growing and improving ``packages'' and a vibrant user community.
Different text: Some other text. **There are NO OTHER COMMANDS after the colon except R and all of it consists of ONE LINE OF TEXT.**
答案3
由于是 Windows 8,因此应该包含 PowerShell,但目前我只能轻松访问 Windows 10 进行测试。以下是您可以在 PowerShell 提示符下运行的内容,假设将提取单行字符串。
给定一个源 .tex 文件
\documentclass[11pt]{article}
\begin{document}
\section{How to Find and Extract Text within Fixed Markers}
\noindent\rule{150pt}{1.7pt}
\vspace{2pt}
\subsubsection{\textcolor{blue}{\textsf{R} Note: History and Current Status of \textsf{R}}}: \textsf{R} is an open source software platform used to analyze data. Written originally in the early 1990's, it is an \textbf{object-oriented}, \textbf{interpreted} programming language blessed with a wealth of constantly-growing and improving "packages" and a vibrant user community.
\clearpage
\section{Each Note is at the bottom of a page. All have the same start up to Note: They end with clearpage}
\noindent\rule{150pt}{1.7pt}
\vspace{2pt}
\subsubsection{\textcolor{blue}{\textsf{R} Note: Different text}}: Some other text.
\subsubsection{\textcolor{blue}{\textsf{R} Note: History and Current Status of \textsf{R}}}: \textsf{R} is an open source software platform used to analyze data. Written originally in the early 1990's, it is an \textbf{object-oriented}, \textbf{interpreted} programming language blessed with a wealth of constantly-growing and improving "packages" and a vibrant user community.
\clearpage
\end{document}
您可以在 PowerShell 提示符下运行以下命令:
sls'^\\subsubsection'290814.tex` | 选择 -ExpandProperty 行 ` | foreach { 写入主机 ($_ -split ': ')[1] }
并得到结果
\textsf{R} 是一个用于分析数据的开源软件平台。它最初于 20 世纪 90 年代初编写,是一种 \textbf{面向对象}、\textbf{解释型} 编程语言,拥有大量不断增长和改进的“软件包”以及活跃的用户社区。 一些其他文本。 \textsf{R} 是一个用于分析数据的开源软件平台。它最初于 20 世纪 90 年代初编写,是一种 \textbf{面向对象}、\textbf{解释型} 编程语言,拥有大量不断增长和改进的“软件包”以及活跃的用户社区。
我绝对不是 PowerShell 专家,但简单的解释是:
sls
是 的别名select-string
,大致相当于 Unix 的grep
- 该
select
行从结果行中删除文件名和行号前缀 - 迭代
foreach
输出的每一行,按序列分割行:
,并返回第二行。