从命令行删除 PDF 中的空白页

Question 1

感谢 gmatht 提供的代码。我已对其进行了修改，使用 GhostScript 检查页面覆盖率，并删除覆盖率低于阈值（0.1%）的页面。

#!/bin/sh
IN="$1"
filename=$(basename "${IN}")
filename="${filename%.*}"
PAGES=$(pdfinfo "$IN" | grep ^Pages: | tr -dc '0-9')

non_blank() {
    for i in $(seq 1 $PAGES)
    do
        PERCENT=$(gs -o -  -dFirstPage=${i} -dLastPage=${i} -sDEVICE=inkcov "$IN" | grep CMYK | nawk 'BEGIN { sum=0; } {sum += $1 + $2 + $3 + $4;} END { printf "%.5f\n", sum } ')
        if [ $(echo "$PERCENT > 0.001" | bc) -eq 1 ]
        then
            echo $i
            #echo $i 1>&2
        fi
        echo -n . 1>&2
    done | tee "$filename.tmp"
    echo 1>&2
}

set +x
pdftk "${IN}" cat $(non_blank) output "${filename}.pdf"

Answer

感谢 gmatht 提供的代码。我已对其进行了修改，使用 GhostScript 检查页面覆盖率，并删除覆盖率低于阈值（0.1%）的页面。

#!/bin/sh
IN="$1"
filename=$(basename "${IN}")
filename="${filename%.*}"
PAGES=$(pdfinfo "$IN" | grep ^Pages: | tr -dc '0-9')

non_blank() {
    for i in $(seq 1 $PAGES)
    do
        PERCENT=$(gs -o -  -dFirstPage=${i} -dLastPage=${i} -sDEVICE=inkcov "$IN" | grep CMYK | nawk 'BEGIN { sum=0; } {sum += $1 + $2 + $3 + $4;} END { printf "%.5f\n", sum } ')
        if [ $(echo "$PERCENT > 0.001" | bc) -eq 1 ]
        then
            echo $i
            #echo $i 1>&2
        fi
        echo -n . 1>&2
    done | tee "$filename.tmp"
    echo 1>&2
}

set +x
pdftk "${IN}" cat $(non_blank) output "${filename}.pdf"

Question 2

似乎没有实用程序可以从 PDF 中删除空白页，但我们可以使用convertimagemagick 中的命令创建颜色直方图。空白幻灯片只有一个条目，可以使用进行检测wc。一旦我们有了非空白页的列表，我们就可以将其输入到中pdftk。

请注意，imagemagick 从 0 开始对页面进行编号，因此我们需要对此进行调整。我们可以在标志中使用较低的值来-density提高性能（尽管太低似乎会导致 imagemagick 分段错误）。

如果我们调用以下脚本pdf_rm_blank.sh，运行pdf_rm_blank.sh A将A.rm.pdf创建A.pdf

#!/bin/sh
IN="$1"
PAGES=$(pdfinfo $IN.pdf | grep ^Pages: | tr -dc '0-9')

non_blank() {
    for i in $(seq 1 $PAGES)
    do
        if [ $(convert -density 35 "$IN.pdf[$((i-1))]" -define histogram:unique-colors=true -format %c histogram:info:- | wc -l) -ne 1 ]
        then
            echo $i
            #echo $i 1>&2
        fi
        echo -n . 1>&2
    done | tee out.tmp
    echo 1>&2
}

set +x
pdftk $IN.pdf cat $(non_blank) output $IN.rm.pdf

Answer

似乎没有实用程序可以从 PDF 中删除空白页，但我们可以使用convertimagemagick 中的命令创建颜色直方图。空白幻灯片只有一个条目，可以使用进行检测wc。一旦我们有了非空白页的列表，我们就可以将其输入到中pdftk。

请注意，imagemagick 从 0 开始对页面进行编号，因此我们需要对此进行调整。我们可以在标志中使用较低的值来-density提高性能（尽管太低似乎会导致 imagemagick 分段错误）。

如果我们调用以下脚本pdf_rm_blank.sh，运行pdf_rm_blank.sh A将A.rm.pdf创建A.pdf

#!/bin/sh
IN="$1"
PAGES=$(pdfinfo $IN.pdf | grep ^Pages: | tr -dc '0-9')

non_blank() {
    for i in $(seq 1 $PAGES)
    do
        if [ $(convert -density 35 "$IN.pdf[$((i-1))]" -define histogram:unique-colors=true -format %c histogram:info:- | wc -l) -ne 1 ]
        then
            echo $i
            #echo $i 1>&2
        fi
        echo -n . 1>&2
    done | tee out.tmp
    echo 1>&2
}

set +x
pdftk $IN.pdf cat $(non_blank) output $IN.rm.pdf

Question 3

正如大多数答案所表明的那样，似乎没有一种工具可以完成此任务。我不得不拼凑一些现有的工具，从gs使用@Antony 的回答。

我发现我无法从上到下实现自动化，它需要一些微调。下面的脚本获取目录并对所有 PDF 文件进行批量操作。我最终得到了 3 个不同的步骤：

使用 Ghostscript 的ink_cov输出设备（我发现ink_cov的平均百分比值比返回的“非零通道像素的百分比”值更有用inkcov）：

#!/usr/bin/env bash

device="ink_cov"
out="/tmp/pdf_trim/analysis.txt"

[ "$#" -eq 1 ] || { echo "Target directory required as argument"; exit 1; }
in="$(realpath "$1")"
[ -f "$out" ] && rm "$out" || mkdir -p "$(dirname "$out")"
pushd "$in"
find . -name '*.pdf' | while read p; do
  gs -o - -sDEVICE="$device" "$p" | grep CMYK | grep -n '' | \
    sed 's/:/ /; s|^|'$in' '$(echo "$p" | sed 's|^\./||')' |' | \
    tee -a "$out"
done

使用示例：

./script_1 ./target_folder

尝试使用 GhostScript 报告的指标中的不同“空白页”标准，直到找到适合您 PDF 的“好标准”。好的标准应该将您的页面从空白页到非空白页进行排序。

#!/usr/bin/env bash

criteria='$4+$5+$6+$7'

in="/tmp/pdf_trim/analysis.txt"
out="/tmp/pdf_trim/criteria"

tmp_over="/tmp/pdf_trim/over.pdf"
tmp_under="/tmp/pdf_trim/under.pdf"

# Apply the criteria to each line, and then sort
with_criteia="$(cat "$in" | awk '{ print $0, '$criteria' }' | \
  sort -n -k 10 | tee "$out.txt")"

# Create an overlay pdf with the criteria values printed
echo "$with_criteia" | awk '{printf "%s: %s p%s\n", $10, $2, $3 }' | \
  enscript --no-header --font Courier-Bold18 --lines-per-page 1 -o - | \
  ps2pdf - "$tmp_over"

# Create an underlay pdf with the sorted pages by generating PDFtk handle lists
handles="$(paste -d ' ' \
  <(echo "$with_criteia" | grep -n '' | sed 's/:.*//' | tr '0-9' 'A-Z') \
  <(echo "$with_criteia"))"
pushd "$1"
pdftk $(echo "$handles" | awk '{ printf "%s=%s/%s ", $1, $2, $3 }') \
  cat $(echo "$handles" | awk '{ printf "%s%s ", $1, $4}') \
  output "$tmp_under"

# Merge them into the final result & remove temporary files
pdftk "$tmp_over" multibackground "$tmp_under" output "$out.pdf"
rm "$tmp_over" "$tmp_under"

使用示例（这将创建一个criteria.pdf文件，其中的页面按照您选择的标准排序）：

./script_2

在新目录中批量重新生成每个 PDF，减去空白页：

#!/usr/bin/env bash

threshold=1.59

input="/tmp/pdf_trim/criteria.txt"

[ "$#" -eq 1 ] || { echo "Output directory required as argument"; exit 1; }
out="$(realpath "$1")"

in_list="$(cat $input)"
out_list="$(cat "$input" | awk '$10 >'$threshold' {print}' | \
  sort -k 2,2 -k 3,3n)"

in_files="$(echo "$in_list" | cut -d ' ' -f 1,2 | sort -u )"
out_files="$(echo "$out_list" | cut -d ' ' -f 1,2 | sort -u)"

echo "$out_files" | while read f; do
  dest="$(echo "$f" | sed 's|[^ ]* |'$out'/|; s/\.pdf$/_trimmed\.pdf/')"
  echo "$dest"
  mkdir -p "$(dirname "$dest")"
  pdftk "$(echo "$f" | sed 's| |/|')" \
    cat $(echo "$out_list" | grep "$f" | cut -d ' ' -f 3 | tr '\n' ' ' | \
      sed 's/ $//') \
    output "$dest"
done

printf "\nTrimmed %s pages with criteria value below %s\n" \
  "$(($(echo "$in_list" | wc -l) - $(echo "$out_list" | wc -l)))" "$threshold"
printf "All pages were skipped from the following files:\n%s\n" \
  "$(comm -23 <(echo "$in_files") <(echo "$out_files") | sed 's/^/\t/; s| |/|')"

使用示例：

./script_3 ./output_directory

我写过详细帖子关于这一点，提供有关每个步骤的更多信息和解释。

Answer

正如大多数答案所表明的那样，似乎没有一种工具可以完成此任务。我不得不拼凑一些现有的工具，从gs使用@Antony 的回答。

我发现我无法从上到下实现自动化，它需要一些微调。下面的脚本获取目录并对所有 PDF 文件进行批量操作。我最终得到了 3 个不同的步骤：

使用 Ghostscript 的ink_cov输出设备（我发现ink_cov的平均百分比值比返回的“非零通道像素的百分比”值更有用inkcov）：

#!/usr/bin/env bash

device="ink_cov"
out="/tmp/pdf_trim/analysis.txt"

[ "$#" -eq 1 ] || { echo "Target directory required as argument"; exit 1; }
in="$(realpath "$1")"
[ -f "$out" ] && rm "$out" || mkdir -p "$(dirname "$out")"
pushd "$in"
find . -name '*.pdf' | while read p; do
  gs -o - -sDEVICE="$device" "$p" | grep CMYK | grep -n '' | \
    sed 's/:/ /; s|^|'$in' '$(echo "$p" | sed 's|^\./||')' |' | \
    tee -a "$out"
done

使用示例：

./script_1 ./target_folder

尝试使用 GhostScript 报告的指标中的不同“空白页”标准，直到找到适合您 PDF 的“好标准”。好的标准应该将您的页面从空白页到非空白页进行排序。

#!/usr/bin/env bash

criteria='$4+$5+$6+$7'

in="/tmp/pdf_trim/analysis.txt"
out="/tmp/pdf_trim/criteria"

tmp_over="/tmp/pdf_trim/over.pdf"
tmp_under="/tmp/pdf_trim/under.pdf"

# Apply the criteria to each line, and then sort
with_criteia="$(cat "$in" | awk '{ print $0, '$criteria' }' | \
  sort -n -k 10 | tee "$out.txt")"

# Create an overlay pdf with the criteria values printed
echo "$with_criteia" | awk '{printf "%s: %s p%s\n", $10, $2, $3 }' | \
  enscript --no-header --font Courier-Bold18 --lines-per-page 1 -o - | \
  ps2pdf - "$tmp_over"

# Create an underlay pdf with the sorted pages by generating PDFtk handle lists
handles="$(paste -d ' ' \
  <(echo "$with_criteia" | grep -n '' | sed 's/:.*//' | tr '0-9' 'A-Z') \
  <(echo "$with_criteia"))"
pushd "$1"
pdftk $(echo "$handles" | awk '{ printf "%s=%s/%s ", $1, $2, $3 }') \
  cat $(echo "$handles" | awk '{ printf "%s%s ", $1, $4}') \
  output "$tmp_under"

# Merge them into the final result & remove temporary files
pdftk "$tmp_over" multibackground "$tmp_under" output "$out.pdf"
rm "$tmp_over" "$tmp_under"

使用示例（这将创建一个criteria.pdf文件，其中的页面按照您选择的标准排序）：

./script_2

在新目录中批量重新生成每个 PDF，减去空白页：

#!/usr/bin/env bash

threshold=1.59

input="/tmp/pdf_trim/criteria.txt"

[ "$#" -eq 1 ] || { echo "Output directory required as argument"; exit 1; }
out="$(realpath "$1")"

in_list="$(cat $input)"
out_list="$(cat "$input" | awk '$10 >'$threshold' {print}' | \
  sort -k 2,2 -k 3,3n)"

in_files="$(echo "$in_list" | cut -d ' ' -f 1,2 | sort -u )"
out_files="$(echo "$out_list" | cut -d ' ' -f 1,2 | sort -u)"

echo "$out_files" | while read f; do
  dest="$(echo "$f" | sed 's|[^ ]* |'$out'/|; s/\.pdf$/_trimmed\.pdf/')"
  echo "$dest"
  mkdir -p "$(dirname "$dest")"
  pdftk "$(echo "$f" | sed 's| |/|')" \
    cat $(echo "$out_list" | grep "$f" | cut -d ' ' -f 3 | tr '\n' ' ' | \
      sed 's/ $//') \
    output "$dest"
done

printf "\nTrimmed %s pages with criteria value below %s\n" \
  "$(($(echo "$in_list" | wc -l) - $(echo "$out_list" | wc -l)))" "$threshold"
printf "All pages were skipped from the following files:\n%s\n" \
  "$(comm -23 <(echo "$in_files") <(echo "$out_files") | sed 's/^/\t/; s| |/|')"

使用示例：

./script_3 ./output_directory

我写过详细帖子关于这一点，提供有关每个步骤的更多信息和解释。

Question 4

如果您可以假设页面没有文本则为空，则可以使用以下代码。如果您的 pdf 页面只有图表、图像等，我认为这行不通。

首先使用 xpdf/pdf2text 提取 pdf 的 txt。使用字符 0x0C 检测“分页符/新页面”。要删除空白页，只需使用 pdftk 并将所有非空白页 cat 到新 pdf。

  /** page break constant in pdf */
  private static final String PAGEBREAK = new String(new byte[] { 0x0C });
  /** dummy string for an empty page */
  private static final String EMPTY_PAGE = "EMPTY_PAGE";

  /**
   * @param pdfIn
   * @param pdfOut
   * @param txt --> contains pdf2txt output of tool xpdf/pdftotext.exe
   * @return
   * @throws Exception
   */
  private static byte[] removeEmptyPages(File pdfIn, File pdfOut, String txt) throws Exception {
    // replace "page break" with some dummytext+"page break"
    txt = txt.replace(PAGEBREAK, EMPTY_PAGE + PAGEBREAK);
    StringTokenizer tokenizer = new StringTokenizer(txt, PAGEBREAK);
    int pageCounter = 0;
    String pagesWithContent = "";
    boolean foundEmptyPage = false;
    String currentPage = null;
    while (tokenizer.hasMoreTokens()) {
      currentPage = tokenizer.nextToken();
      pageCounter++;
      if (currentPage.equals(EMPTY_PAGE)) {
        foundEmptyPage = true;
      } else {
        pagesWithContent += (pageCounter + " ");
      }
    }

    if (foundEmptyPage) {
      String pdfShellCmd = "..\\tools\\pdftk\\bin\\pdftk.exe \"$IN\" cat $PAGES output \"$OUT\"";
      String cmd = pdfShellCmd.replace("$IN", pdfIn.toString());
      cmd = cmd.replace("$OUT", pdfOut.toString());
      cmd = cmd.replace("$PAGES", pagesWithContent);

      int resultCode = executeShellCmd(cmd);
      if (0 == resultCode) {
        return FileTools.readFile(pdfOut).array();
      } else {
        throw new Exception("Result code for " + cmd + " was " + resultCode);
      }
    } else {
      // if no empty pages, return input file
      copyFile(pdfIn, pdfOut);
      return read(pdfOut);
    }
  }

Answer

如果您可以假设页面没有文本则为空，则可以使用以下代码。如果您的 pdf 页面只有图表、图像等，我认为这行不通。

首先使用 xpdf/pdf2text 提取 pdf 的 txt。使用字符 0x0C 检测“分页符/新页面”。要删除空白页，只需使用 pdftk 并将所有非空白页 cat 到新 pdf。

  /** page break constant in pdf */
  private static final String PAGEBREAK = new String(new byte[] { 0x0C });
  /** dummy string for an empty page */
  private static final String EMPTY_PAGE = "EMPTY_PAGE";

  /**
   * @param pdfIn
   * @param pdfOut
   * @param txt --> contains pdf2txt output of tool xpdf/pdftotext.exe
   * @return
   * @throws Exception
   */
  private static byte[] removeEmptyPages(File pdfIn, File pdfOut, String txt) throws Exception {
    // replace "page break" with some dummytext+"page break"
    txt = txt.replace(PAGEBREAK, EMPTY_PAGE + PAGEBREAK);
    StringTokenizer tokenizer = new StringTokenizer(txt, PAGEBREAK);
    int pageCounter = 0;
    String pagesWithContent = "";
    boolean foundEmptyPage = false;
    String currentPage = null;
    while (tokenizer.hasMoreTokens()) {
      currentPage = tokenizer.nextToken();
      pageCounter++;
      if (currentPage.equals(EMPTY_PAGE)) {
        foundEmptyPage = true;
      } else {
        pagesWithContent += (pageCounter + " ");
      }
    }

    if (foundEmptyPage) {
      String pdfShellCmd = "..\\tools\\pdftk\\bin\\pdftk.exe \"$IN\" cat $PAGES output \"$OUT\"";
      String cmd = pdfShellCmd.replace("$IN", pdfIn.toString());
      cmd = cmd.replace("$OUT", pdfOut.toString());
      cmd = cmd.replace("$PAGES", pagesWithContent);

      int resultCode = executeShellCmd(cmd);
      if (0 == resultCode) {
        return FileTools.readFile(pdfOut).array();
      } else {
        throw new Exception("Result code for " + cmd + " was " + resultCode);
      }
    } else {
      // if no empty pages, return input file
      copyFile(pdfIn, pdfOut);
      return read(pdfOut);
    }
  }

从命令行删除 PDF 中的空白页

答案1

答案2

答案3

答案4

相关内容