我想将一个带有一些彩色文本和图像的pdf转换为另一个只有黑白的pdf,以减少其尺寸。此外,我想将文本保留为文本,而不转换图片中的页面元素。我尝试了以下命令:
convert -density 150 -threshold 50% input.pdf output.pdf
在另一个问题中发现,一条链接,但它做了我不想要的事情:输出中的文本被转换成糟糕的图像,并且不再可选。我尝试使用 Ghostscript:
gs -sOutputFile=output.pdf \
-q -dNOPAUSE -dBATCH -dSAFER \
-sDEVICE=pdfwrite \
-dCompatibilityLevel=1.3 \
-dPDFSETTINGS=/screen \
-dEmbedAllFonts=true \
-dSubsetFonts=true \
-sColorConversionStrategy=/Mono \
-sColorConversionStrategyForImages=/Mono \
-sProcessColorModel=/DeviceGray \
$1
但它给了我以下错误消息:
./script.sh: 19: ./script.sh: output.pdf: not found
还有其他方法来创建该文件吗?
答案1
GS 示例
gs
您上面运行的命令有一个尾随,$1
通常用于将命令行参数传递到脚本中。所以我不确定您实际尝试过什么,但我猜测您尝试将该命令放入脚本中script.sh
:
#!/bin/bash
gs -sOutputFile=output.pdf \
-q -dNOPAUSE -dBATCH -dSAFER \
-sDEVICE=pdfwrite \
-dCompatibilityLevel=1.3 \
-dPDFSETTINGS=/screen \
-dEmbedAllFonts=true \
-dSubsetFonts=true \
-sColorConversionStrategy=/Mono \
-sColorConversionStrategyForImages=/Mono \
-sProcessColorModel=/DeviceGray \
$1
并像这样运行它:
$ ./script.sh: 19: ./script.sh: output.pdf: not found
不确定如何设置此脚本,但它需要可执行。
$ chmod +x script.sh
不过这个脚本肯定有什么不对劲的地方。当我尝试它时,我得到了这个错误:
不可恢复的错误:.putdeviceprops 中的 rangecheck
替代
我会使用 SU 问题中的这个脚本来代替该脚本。
#!/bin/bash
gs \
-sOutputFile=output.pdf \
-sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray \
-dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 \
-dNOPAUSE \
-dBATCH \
$1
然后像这样运行它:
$ ./script.bash LeaseContract.pdf
GPL Ghostscript 8.71 (2010-02-10)
Copyright (C) 2010 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 2.
Page 1
Page 2
答案2
我找到了一个脚本这里可以做到这一点。它要求gs
你似乎拥有但也pdftk
。您没有提到您的发行版,但在基于 Debian 的系统上,您应该能够使用以下命令安装它
sudo apt-get install pdftk
您可以找到它的 RPM这里。
安装后pdftk
,将脚本另存为graypdf.sh
并运行,如下所示:
./greypdf.sh input.pdf
它将创建一个名为input-gray.pdf
.我在此处包含整个脚本以避免链接失效:
# convert pdf to grayscale, preserving metadata
# "AFAIK graphicx has no feature for manipulating colorspaces. " http://groups.google.com/group/latexusersgroup/browse_thread/thread/5ebbc3ff9978af05
# "> Is there an easy (or just standard) way with pdflatex to do a > conversion from color to grayscale when a PDF file is generated? No." ... "If you want to convert a multipage document then you better have pdftops from the xpdf suite installed because Ghostscript's pdf to ps doesn't produce nice Postscript." http://osdir.com/ml/tex.pdftex/2008-05/msg00006.html
# "Converting a color EPS to grayscale" - http://en.wikibooks.org/wiki/LaTeX/Importing_Graphics
# "\usepackage[monochrome]{color} .. I don't know of a neat automatic conversion to monochrome (there might be such a thing) although there was something in Tugboat a while back about mapping colors on the fly. I would probably make monochrome versions of the pictures, and name them consistently. Then conditionally load each one" http://newsgroups.derkeiler.com/Archive/Comp/comp.text.tex/2005-08/msg01864.html
# "Here comes optional.sty. By adding \usepackage{optional} ... \opt{color}{\includegraphics[width=0.4\textwidth]{intro/benzoCompounds_color}} \opt{grayscale}{\includegraphics[width=0.4\textwidth]{intro/benzoCompounds}} " - http://chem-bla-ics.blogspot.com/2008/01/my-phd-thesis-in-color-and-grayscale.html
# with gs:
# http://handyfloss.net/2008.09/making-a-pdf-grayscale-with-ghostscript/
# note - this strips metadata! so:
# http://etutorials.org/Linux+systems/pdf+hacks/Chapter+5.+Manipulating+PDF+Files/Hack+64+Get+and+Set+PDF+Metadata/
COLORFILENAME=$1
OVERWRITE=$2
FNAME=${COLORFILENAME%.pdf}
# NOTE: pdftk does not work with logical page numbers / pagination;
# gs kills it as well;
# so check for existence of 'pdfmarks' file in calling dir;
# if there, use it to correct gs logical pagination
# for example, see
# http://askubuntu.com/questions/32048/renumber-pages-of-a-pdf/65894#65894
PDFMARKS=
if [ -e pdfmarks ] ; then
PDFMARKS="pdfmarks"
echo "$PDFMARKS exists, using..."
# convert to gray pdf - this strips metadata!
gs -sOutputFile=$FNAME-gs-gray.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH "$COLORFILENAME" "$PDFMARKS"
else # not really needed ?!
gs -sOutputFile=$FNAME-gs-gray.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH "$COLORFILENAME"
fi
# dump metadata from original color pdf
## pdftk $COLORFILENAME dump_data output $FNAME.data.txt
# also: pdfinfo -meta $COLORFILENAME
# grep to avoid BookmarkTitle/Level/PageNumber:
pdftk $COLORFILENAME dump_data output | grep 'Info\|Pdf' > $FNAME.data.txt
# "pdftk can take a plain-text file of these same key/value pairs and update a PDF's Info dictionary to match. Currently, it does not update the PDF's XMP stream."
pdftk $FNAME-gs-gray.pdf update_info $FNAME.data.txt output $FNAME-gray.pdf
# (http://wiki.creativecommons.org/XMP_Implementations : Exempi ... allows reading/writing XMP metadata for various file formats, including PDF ... )
# clean up
rm $FNAME-gs-gray.pdf
rm $FNAME.data.txt
if [ "$OVERWRITE" == "y" ] ; then
echo "Overwriting $COLORFILENAME..."
mv $FNAME-gray.pdf $COLORFILENAME
fi
# BUT NOTE:
# Mixing TEX & PostScript : The GEX Model - http://www.tug.org/TUGboat/Articles/tb21-3/tb68kost.pdf
# VTEX is a (commercial) extended version of TEX, sold by MicroPress, Inc. Free versions of VTEX have recently been made available, that work under OS/2 and Linux. This paper describes GEX, a fast fully-integrated PostScript interpreter which functions as part of the VTEX code-generator. Unless specified otherwise, this article describes the functionality in the free- ware version of the VTEX compiler, as available on CTAN sites in systems/vtex.
# GEX is a graphics counterpart to TEX. .. Since GEX may exercise subtle influence on TEX (load fonts, or change TEX registers), GEX is op- tional in VTEX implementations: the default oper- ation of the program is with GEX off; it is enabled by a command-line switch.
# \includegraphics[width=1.3in, colorspace=grayscale 256]{macaw.jpg}
# http://mail.tug.org/texlive/Contents/live/texmf-dist/doc/generic/FAQ-en/html/FAQ-TeXsystems.html
# A free version of the commercial VTeX extended TeX system is available for use under Linux, which among other things specialises in direct production of PDF from (La)TeX input. Sadly, it���s no longer supported, and the ready-built images are made for use with a rather ancient Linux kernel.
# NOTE: another way to capture metadata; if converting via ghostscript:
# http://compgroups.net/comp.text.pdf/How-to-specify-metadata-using-Ghostscript
# first:
# grep -a 'Keywo' orig.pdf
# /Author(xxx)/Title(ttt)/Subject()/Creator(LaTeX)/Producer(pdfTeX-1.40.12)/Keywords(kkkk)
# then - copy this data in a file prologue.ini:
#/pdfmark where {pop} {userdict /pdfmark /cleartomark load put} ifelse
#[/Author(xxx)
#/Title(ttt)
#/Subject()
#/Creator(LaTeX with hyperref package + gs w/ prologue)
#/Producer(pdfTeX-1.40.12)
#/Keywords(kkkk)
#/DOCINFO pdfmark
#
# finally, call gs on the orig file,
# asking to process pdfmarks in prologue.ini:
# gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
# -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -dDOPDFMARKS \
# -sOutputFile=out.pdf in.pdf prologue.ini
# then the metadata will be in output too (which is stripped otherwise;
# note bookmarks are preserved, however).
答案3
我还有一些扫描的彩色 pdf 和灰度 pdf,我想将其转换为黑白。我尝试gs
与此处列出的代码,图像质量良好,pdf 文本仍然存在。但是,该 gs 代码仅转换为灰度(如问题中所要求的)并且文件大小仍然很大。convert
直接使用时效果很差。
我想要具有良好图像质量和较小文件大小的黑白 pdf。我本想尝试 terdon 的解决方案,但我无法pdftk
使用 yum 进入 centOS 7(在撰写本文时)。
我的解决方案用于gs
从pdf中提取灰度bmp文件,convert
将这些bmp阈值设置为bw并将它们保存为tiff文件,然后图像2pdf压缩 tiff 图像并将它们全部合并到一个 pdf 中。
我尝试直接从 pdf 转到 tiff,但质量不一样,所以我将每个页面保存为 bmp。对于一页 pdf 文件,convert
从 bmp 到 pdf 的效果非常好。例子:
gs -sDEVICE=bmpgray -dNOPAUSE -dBATCH -r300x300 \
-sOutputFile=./pdf_image.bmp ./input.pdf
convert ./pdf_image.bmp -threshold 40% -compress zip ./bw_out.pdf
对于多页,gs
可以将多个 pdf 文件合并为一个,但img2pdf
生成的文件大小比 gs 小。 tiff 文件必须解压缩作为 img2pdf 的输入。请记住,对于大量页面,中间 bmp 和 tiff 文件的大小往往很大。pdftk
或者joinpdf
如果他们可以合并压缩的 pdf 文件,那就更好了convert
。
我想有一个更优雅的解决方案。然而,我的方法产生的结果具有非常好的图像质量和更小的文件大小。要将文本恢复到黑白 pdf 中,请再次运行 OCR。
我的 shell 脚本使用 gs、convert 和 img2pdf。根据需要更改开头列出的参数(页数、扫描 dpi、阈值 % 等),然后运行chmod +x ./pdf2bw.sh
。这是完整的脚本(pdf2bw.sh):
#!/bin/bash
num_pages=12
dpi_res=300
input_pdf_name=color_or_grayscale.pdf
bw_threshold=40%
output_pdf_name=out_bw.pdf
#-------------------------------------------------------------------------
gs -sDEVICE=bmpgray -dNOPAUSE -dBATCH -q -r$dpi_res \
-sOutputFile=./%d.bmp ./$input_pdf_name
#-------------------------------------------------------------------------
for file_num in `seq 1 $num_pages`
do
convert ./$file_num.bmp -threshold $bw_threshold \
./$file_num.tif
done
#-------------------------------------------------------------------------
input_files=""
for file_num in `seq 1 $num_pages`
do
input_files+="./$file_num.tif "
done
img2pdf -o ./$output_pdf_name --dpi $dpi_res $input_files
#-------------------------------------------------------------------------
# clean up bmp and tif files used in conversion
for file_num in `seq 1 $num_pages`
do
rm ./$file_num.bmp
rm ./$file_num.tif
done
答案4
RHEL6 和 RHEL5(两者都以 8.70 为 Ghostscript 的基线)无法使用上面给出的命令的形式。假设脚本或函数期望 PDF 文件作为第一个参数“$1”,则以下内容应该更可移植:
gs \
-sOutputFile="grey_$1" \
-sDEVICE=pdfwrite \
-sColorConversionStrategy=Mono \
-sColorConversionStrategyForImages=/Mono \
-dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.3 \
-dNOPAUSE -dBATCH \
"$1"
输出文件将以“grey_”为前缀。
RHEL6和5都可以使用兼容性级别=1.4这要快得多,但我的目标是便携性。