我有一个文档,我想应用pdftotext -layout
并解析输出文本。我遇到的问题是文本分为两列,而输出pdftotext
有时会交错两列(因为两列的基线不在网格上)。
有没有办法输出相同的 pdf,但将两列视为不同的页面(这样pdftotext -layout
就不会在同一页上读取两列)?下面是一个简单的例子。我考虑过简单地更改边距,以便页面成为 中的单列\textwidth
,但具体的例子涉及浮动、切换到一列 + longtable 中间文档,以及页面引用、同上等内容。 的预期用途pdftotext -layout
是检查单词是否在连续的行上重复,因此文本和布局需要相同。
例如,
\documentclass{article}
\usepackage[twocolumn]{geometry}
\usepackage{lipsum}
\usepackage{parskip}
\begin{document}
\lipsum
\end{document}
生成结果:
Lorem ipsum dolor sit amet, consectetuer adip- vulputate metus eu enim. Vestibulum pellen-
iscing elit. Ut purus elit, vestibulum ut, plac- tesque felis eu massa.
erat ac, adipiscing vitae, felis. Curabitur dic-
Quisque ullamcorper placerat ipsum. Cras nibh.
tum gravida mauris. Nam arcu libero, nonummy
Morbi vel justo vitae lacus tincidunt ultrices.
eget, consectetuer id, vulputate a, magna. Donec
Lorem ipsum dolor sit amet, consectetuer adip-
vehicula augue eu neque. Pellentesque habitant
iscing elit. In hac habitasse platea dictumst.
morbi tristique senectus et netus et malesuada
Integer tempus convallis augue. Etiam facili-
fames ac turpis egestas. Mauris ut leo. Cras
sis. Nunc elementum fermentum wisi. Aenean
viverra metus rhoncus sem. Nulla et lectus vestibu-
placerat. Ut imperdiet, enim sed gravida sollic-
lum urna fringilla ultrices. Phasellus eu tellus sit
itudin, felis odio placerat quam, ac pulvinar elit
amet tortor gravida placerat. Integer sapien est,
purus eget enim. Nunc vitae tortor. Proin tem-
iaculis in, pretium quis, viverra ac, nunc. Prae-
pus nibh sit amet nisl. Vivamus quis tortor vitae
sent eget sem vel leo ultrices bibendum. Aenean
risus porta vehicula.
faucibus. Morbi dolor nulla, malesuada eu, pul-
vinar at, mollis ac, nulla. Curabitur auctor sem- Fusce mauris. Vestibulum luctus nibh at lectus.
per nulla. Donec varius orci eget risus. Duis Sed bibendum, nulla a faucibus semper, leo velit
nibh mi, congue eu, accumsan eleifend, sagittis ultricies tellus, ac venenatis arcu wisi vel nisl.
答案1
因为这只是为了提取文本,所以它不是很漂亮,但只需将列一个接一个地排列,而不是并排排列即可
它复制了一个较大的宏,但只是将 hbox 和 vrule 更改为 vbox 和 hrule
\documentclass{article}
\usepackage[twocolumn]{geometry}
\paperheight=2\paperheight
\makeatletter
\def\@outputdblcol{%
\if@firstcolumn
\global\@firstcolumnfalse
\global\setbox\@leftcolumn\copy\@outputbox
\splitmaxdepth\maxdimen
\vbadness\maxdimen
\setbox\@outputbox\vbox{\unvbox\@outputbox\unskip}%
\setbox\@outputbox\vsplit\@outputbox to\maxdimen
\toks@\expandafter{\topmark}%
\xdef\@firstcoltopmark{\the\toks@}%
\toks@\expandafter{\splitfirstmark}%
\xdef\@firstcolfirstmark{\the\toks@}%
\ifx\@firstcolfirstmark\@empty
\global\let\@setmarks\relax
\else
\gdef\@setmarks{%
\let\firstmark\@firstcolfirstmark
\let\topmark\@firstcoltopmark}%
\fi
\else
\global\@firstcolumntrue
\setbox\@outputbox\vbox{%
\hb@xt@\textwidth{%
\vbox{%
\hb@xt@\columnwidth{\box\@leftcolumn \hss}%
% \hfil
% {\normalcolor\vrule \@width\columnseprule}%
% \hfil
\hrule
\hb@xt@\columnwidth{\box\@outputbox \hss}}%
}%
}%
\@combinedblfloats
\@setmarks
\@outputpage
\begingroup
\@dblfloatplacement
\@startdblcolumn
\@whilesw\if@fcolmade \fi{\@outputpage
\@startdblcolumn}%
\endgroup
\fi}%
\makeatother
\usepackage{lipsum}
\usepackage{parskip}
\begin{document}
\lipsum
\end{document}