pdfcrop 生成更大的文件吗?

pdfcrop 生成更大的文件吗?

我用pdfcrop它来去除 10MB 大小、400 页的 PDF 中的边距。边距已正确去除,但生成的 PDF 大小为 51MB?有什么建议吗?

答案1

这里是我的一个改进版本pdfcrop

默认操作是从 pdf 输入中删除白色边距,可选择留下用户定义的额外边距(选项-m ...)。

另一种操作是按照用户定义的量修剪页面边缘(选项-t ...)。

pdfcrop.sh使用gs(Ghostscript)按页面确定紧密封闭的边界框,pdftk解压缩/压缩 PDF 文件并获取页面顺序(不需要是线性的),以及向每个 PDF 页面perl添加/CropBox代表新找到的紧密边界框的单独条目。

与原始版本不同pdfcrop,下面的 bash 脚本保留了 PDF 的原始交互部分(链接、注释等)。输出文件大小与以前大致相同。

更新:-two添加了双面页面布局选项

使用示例:

#getting help
pdfcrop.sh -help

#default operation
pdfcrop.sh orig.pdf cropped.pdf
pdfcrop.sh -m 10 orig.pdf cropped.pdf
pdfcrop.sh -hires orig.pdf cropped.pdf

#trimming pages
pdfcrop.sh -t "10 20 30 40" orig.pdf trimmed.pdf
#same for two-sided layout
pdfcrop.sh -t "10 20 30 40" -two orig.pdf trimmed.pdf

内容pdfcrop.sh

#!/bin/bash

function usage () {
  echo "Usage: `basename $0` [Options] <input.pdf> [<output.pdf>]"
  echo
  echo " * Removes white margins from every page in the file. (Default operation)"
  echo " * Trims page edges by given amounts. (Alternative operation)"
  echo
  echo "If only <input.pdf> is given, it is overwritten with the cropped output."
  echo
  echo "Options:"
  echo
  echo " -m \"<left> [<bottom> [<right> <top>]]\""
  echo "    adds extra margins in default operation mode. Unit is bp. A single number"
  echo "    is used for all margins, two numbers \"<left> <bottom>\" are applied to the"
  echo "    right and top margins alike."
  echo
  echo " -t \"<left> [<bottom> [<right> <top>]]\""
  echo "    trims outer page edges by the given amounts. Unit is bp. A single number"
  echo "    is used for all trims, two numbers \"<left> <bottom>\" are applied to the"
  echo "    right and top trims alike."
  echo
  echo " -two"
  echo "    to be used for documents with two-sided page layout; the meaning of <left>"
  echo "    and <right> changes to <inner> and <outer> for options -m and -t"
  echo
  echo " -hires"
  echo "    %%HiResBoundingBox is used in default operation mode."
  echo
  echo " -help"
  echo "    prints this message."
}

c=0
mar=(0 0 0 0); tri=(0 0 0 0)
bbtype=BoundingBox
two=0

while getopts m:t:h: opt
do
  case $opt
  in
    m)
    eval mar=($OPTARG)
    [[ -z "${mar[1]}" ]] && mar[1]=${mar[0]}
    [[ -z "${mar[2]}" || -z "${mar[3]}" ]] && mar[2]=${mar[0]} && mar[3]=${mar[1]}
    c=0
    ;;
    t)
    if [[ "$OPTARG" == "wo" ]]
    then
      two=1
    else
      eval tri=($OPTARG)
      [[ -z "${tri[1]}" ]] && tri[1]=${tri[0]}
      [[ -z "${tri[2]}" || -z "${tri[3]}" ]] && tri[2]=${tri[0]} && tri[3]=${tri[1]}
      c=1
    fi
    ;;
    h)
    if [[ "$OPTARG" == "ires" ]]
    then
      bbtype=HiResBoundingBox
    else
      usage 1>&2; exit 0
    fi
    ;;
    \?)
    usage 1>&2; exit 1
    ;;
  esac
done
shift $((OPTIND-1))

[[ -z "$1" ]] && echo "`basename $0`: missing filename" 1>&2 && usage 1>&2 && exit 1
input=$1;output=$1;shift;
[[ -n "$1" ]] && output=$1 && shift;

(
    [[ "$c" -eq 0 ]] && gs -dNOPAUSE -q -dBATCH -sDEVICE=bbox "$input" 2>&1 | grep "%%$bbtype"
    pdftk "$input" output - uncompress
) | perl -w -n -s -e '
  BEGIN {@m=split /\s+/, $mar; @t=split /\s+/, $tri; @mb=(); $p=-1;}
  sub insCropBox {
    if($c){
      if($two && $p%2) {
        $mb[0]+=$t[2];$mb[1]+=$t[1];$mb[2]-=$t[0];$mb[3]-=$t[3];
      }
      else {
        $mb[0]+=$t[0];$mb[1]+=$t[1];$mb[2]-=$t[2];$mb[3]-=$t[3];
      }
      print "/CropBox [", join(" ", @mb), "]\n";
    } else {
      @bb=split /\s+/, $bbox[$p];
      if($two && $p%2) {
        $bb[0]+=$mb[0];$bb[1]+=$mb[1];$bb[2]+=$mb[0];$bb[3]+=$mb[1];
        $bb[0]-=$m[2];$bb[1]-=$m[1];$bb[2]+=$m[0];$bb[3]+=$m[3];
      }
      else {
        $bb[0]+=$mb[0];$bb[1]+=$mb[1];$bb[2]+=$mb[0];$bb[3]+=$mb[1];
        $bb[0]-=$m[0];$bb[1]-=$m[1];$bb[2]+=$m[2];$bb[3]+=$m[3];
      }
      print "/CropBox [", join(" ", @bb), "]\n";
    }
  }
  if (/BoundingBox:\s+([\d\.\s]+\d)/) { push @bbox, $1; next;}
  elsif (/\/CropBox\s+\[([\d\.\s]+\d)\]/) {next;}
  elsif (/\/MediaBox\s+\[([\d\.\s]+\d)\]/) {
    @mb=split /\s+/, $1; next if($p<0);
    insCropBox; @mb=(); $p=-1;
  }
  elsif (/pdftk_PageNum\s+(\d+)/) {
    $p=$1-1; next unless(@mb);
    insCropBox; @mb=(); $p=-1;
  }
  print;
' -- -mar="${mar[*]}" -tri="${tri[*]}" -c=$c -two=$two | pdftk - output "$output" compress

答案2

我使用此处找到的 Python 脚本:http://www.mobileread.com/forums/showthread.php?t=25565具有以下特点:

  • 输出具有您所要求的合理大小
  • 支持绝对裁剪(以防当您有水平页脚或标题栏时自动计算的边界框没有用)
  • 速度非常快:不到一秒钟就能浏览 200 页!

当然你需要提前安装 pyPdf。由于链接可能无效,我在这里粘贴源代码:

#! /usr/bin/python

import getopt, sys
from pyPdf import PdfFileWriter, PdfFileReader

def usage ():
    print """sjvr767\'s PDF Cropping Script.
Example:
my_pdf_crop.py -s -p 0.5 -i input.pdf -o output.pdf
my_pdf_crop.py --skip --percent 0.5 -input input.pdf -output output.pdf
\n
REQUIRED OPTIONS:
-p\t--percent
The factor by which to crop. Must be positive and less than or equal to 1.

-i\t--input
The path to the file to be cropped.
\n
OPTIONAL:
-s\t--skip
Skip the first page. Ouptut file will not contain the first page of the input file.

-o\t--output
Specify the name and path of the output file. If none specified, the script appends \'cropped\' to the file name.

-m\t--margin
Specify additional absolute cropping, for fine tuning results.
\t-m "left top right bottom"
"""
    sys.exit(0)

def cut_length(dictionary, key, factor):
    cut_factor = 1-factor
    cut = float(dictionary[key])*cut_factor
    cut = cut / 4
    return cut

def new_coords(dictionary, key, cut, margin, code = "tl"):
    if code == "tl":
        if key == "x":
            return abs(float(dictionary[key])+(cut+margin["l"]))
        else:
            return abs(float(dictionary[key])-(cut+margin["t"]))
    elif code == "tr":
        if key == "x":
            return abs(float(dictionary[key])-(cut+margin["r"]))
        else:
            return abs(float(dictionary[key])-(cut+margin["t"]))
    elif code == "bl":
        if key == "x":
            return abs(float(dictionary[key])+(cut+margin["l"]))
        else:
            return abs(float(dictionary[key])+(cut+margin["b"]))
    else:
        if key == "x":
            return abs(float(dictionary[key])-(cut+margin["r"]))
        else:
            return abs(float(dictionary[key])+(cut+margin["b"]))

try:
    opts, args = getopt.getopt(sys.argv[1:], "sp:i:o:m:", ["skip", "percent=", "input=", "output=", "margin="])
except getopt.GetoptError, err:
        # print help information and exit:
        print str(err) # will print something like "option -a not recognized"
        usage()
        sys.exit(2)

skipone = 0

for a in opts[:]:
    if a[0] == '-s' or a[0]=='--skip':
        skipone = 1

factor = 0.8 #default scaling factor

for a in opts[:]:
    if a[0] == '-p' or a[0]=='--factor':
        if a[1] != None:
            try:
                factor = float(a[1])
            except TypeError:
                print "Factor must be a number."
                sys.exit(2) #exit if no appropriate input file

input_file = None #no defualt input file

for a in opts[:]:
    if a[0] == '-i' or a[0]=='--input':
        if a[1] != None:
            try:
                if a[1][-4:]=='.pdf':
                    input_file = a[1]
                else:
                    print "Input file must be a PDF."
                    sys.exit(2) #exit if no appropriate input file
            except TypeError:
                print "Input file must be a PDF."
                sys.exit(2) #exit if no appropriate input file
            except IndexError:
                print "Input file must be a PDF."
                sys.exit(2) #exit if no appropriate input file
        else:
            print "Please speicfy an input file."
            sys.exit(2) #exit if no appropriate input file

output_file = "%s_cropped.pdf" %input_file[:-4] #default output

for a in opts[:]:
    if a[0] == '-o' or a[0]=='--output': 
        if a[1]!= None:
            try:
                if a[1][-4:]=='.pdf':
                    output_file = a[1]
                else:
                    print "Output file must be a PDF."
            except TypeError:
                print "Output file must be a PDF."
            except IndexError:
                print "Output file must be a PDF."

margin = {"l": 0, "t": 0, "r": 0, "b": 0}

for a in opts[:]:
    if a[0] == '-m' or a[0]=='--margin':
        if a[1]!= None:
            m_temp = a[1].strip("\"").split()
            margin["l"] = float(m_temp[0])
            margin["t"] = float(m_temp[1])
            margin["r"] = float(m_temp[2])
            margin["b"] = float(m_temp[3])
        else:
            print "Error"

input1 = PdfFileReader(file(input_file, "rb"))

output = PdfFileWriter()
outputstream = file(output_file, "wb")

pages = input1.getNumPages()

top_right = {'x': input1.getPage(1).mediaBox.getUpperRight_x(), 'y': input1.getPage(1).mediaBox.getUpperRight_y()}
top_left = {'x': input1.getPage(1).mediaBox.getUpperLeft_x(), 'y': input1.getPage(1).mediaBox.getUpperLeft_y()}
bottom_right = {'x': input1.getPage(1).mediaBox.getLowerRight_x(), 'y': input1.getPage(1).mediaBox.getLowerRight_y()}
bottom_left = {'x': input1.getPage(1).mediaBox.getLowerLeft_x(), 'y': input1.getPage(1).mediaBox.getLowerLeft_y()}

print('Page dim.\t%f by %f' %(top_right['x'], top_right['y']))

cut = cut_length(top_right, 'x', factor)

new_tr = (new_coords(top_right, 'x', cut, margin, code = "tr"), new_coords(top_right, 'y', cut, margin, code = "tr"))
new_br = (new_coords(bottom_right, 'x', cut, margin, code = "br"), new_coords(bottom_right, 'y', cut, margin, code = "br" ))
new_tl = (new_coords(top_left, 'x', cut, margin, code = "tl"), new_coords(top_left, 'y', cut, margin, code = "tl"))
new_bl = (new_coords(bottom_left, 'x', cut, margin, code = "bl"), new_coords(bottom_left, 'y', cut, margin, code = "bl"))

if skipone == 0:
    for i in range(0, pages):
        page = input1.getPage(i)
        page.mediaBox.upperLeft = new_tl
        page.mediaBox.upperRight = new_tr
        page.mediaBox.lowerLeft = new_bl
        page.mediaBox.lowerRight = new_br
        output.addPage(page)
else:
    for i in range(1, pages):
        page = input1.getPage(i)
        page.mediaBox.upperLeft = new_tl
        page.mediaBox.upperRight = new_tr
        page.mediaBox.lowerLeft = new_bl
        page.mediaBox.lowerRight = new_br
        output.addPage(page)

output.write(outputstream)
outputstream.close()

答案3

我非常喜欢 Alexander Grahn 的脚本,但我缺少一个允许小边距的功能。我对脚本做了一点小修改,以便像原始 PDF 裁剪一样允许这个边距。

由于我是 Stack Exchange 的这个部分的新手,我无法发表评论,因此我将在此处发布整个脚本。不幸的是,我不太擅长使用 bash,因此我浪费了一些时间尝试使其成为可选的,但最终还是放弃了。我将边距声明保留在 Perl 脚本之外,因此使用更多 bash-foo 应该可以做到。

#!/bin/bash

MARGIN=10

(
    gs -dNOPAUSE -q -dBATCH -sDEVICE=bbox "$1" 2>&1 | grep '%%BoundingBox'
    pdftk "$1" output - uncompress
) | perl -w -n -e '
    $margin = '$MARGIN';
    if (/BoundingBox:\s+(\d+\s+\d+\s+\d+\s+\d+)/) {
        push @bbox, $1; next;
    }
    elsif (/pdftk_PageNum\s+(\d+)/) {
        # Split the sizes
        @sizes = split(/ /, $bbox[$1-1]);

        # Add or substract the margin size
        $j = 0;
        foreach(@sizes) {
            if($j < 2) {
                $_ = $_ - $margin; 
            } else {
                $_ = $_ + $margin;
            }
            $j++;
        }

        # Print the box
        print "/MediaBox [" .join(" ", @sizes) . "]\n";
    }
    elsif (/MediaBox/) {
        next;
    }
    print;
'  | pdftk - output "$2" compress

答案4

我发现这个项目是一个很好的替代方案pdfcrophttps://github.com/abarker/pdfCropMargins 它是一个具有大量命令行选项的 Python 包。还提供可选的 GUI。

例如命令:

$ pdf-crop-margins -u -s in.pdf

裁剪in.pdf,使所有页面设置为相同大小,裁剪量在所有页面上保持一致,默认保留现有边距的 10%。输出文件与输入文件大小大致相同,并且链接和注释也保留。

相关内容