删除 pdf 中的重复页面

Question 1

comparepdf是一个用于比较 PDF 的命令行工具。0如果文件相同则退出代码，否则为非零。您可以通过文本内容或视觉进行比较（例如扫描有趣）：

comparepdf 1.pdf 2.pdf
comparepdf -ca 1.pdf 2.pdf #compare appearance instead of text

所以你可以做的是分解 PDF，然后成对比较并相应删除：

#!/bin/bash
#explode pdf
pdftk original.pdf burst
#compare 900 pages pairwise
for (( i=1 ; i<=899 ; i++ )) ; do
  #pdftk's naming is pg_0001.pdf, pg_0002.pdf etc.
  pdf1=pg_$(printf 04d $i).pdf
  pdf2=pg_$(printf 04d $((i+1))).pdf
  #Remove first file if match. Loop not forwarded in case of three or more consecutive identical pages 
  if comparepdf $pdf1 $pdf2 ; then
     rm $pdf1
  fi
done
#renunite in sorted manner:
pdftk $(find -name 'pg_*.pdf' | sort ) cat output new.pdf

编辑：根据@notauto generated 的评论，人们可能会选择从原始文件中选择页面，而不是统一单页 PDF。两两比较完成后，可以执行以下操作：

pdftk original.pdf cat $(find -name 'pg_*.pdf' |
                        awk -F '[._]' '{printf "%d\n",$3}' |
                        sort -n ) output new.pdf

Answer

comparepdf是一个用于比较 PDF 的命令行工具。0如果文件相同则退出代码，否则为非零。您可以通过文本内容或视觉进行比较（例如扫描有趣）：

comparepdf 1.pdf 2.pdf
comparepdf -ca 1.pdf 2.pdf #compare appearance instead of text

所以你可以做的是分解 PDF，然后成对比较并相应删除：

#!/bin/bash
#explode pdf
pdftk original.pdf burst
#compare 900 pages pairwise
for (( i=1 ; i<=899 ; i++ )) ; do
  #pdftk's naming is pg_0001.pdf, pg_0002.pdf etc.
  pdf1=pg_$(printf 04d $i).pdf
  pdf2=pg_$(printf 04d $((i+1))).pdf
  #Remove first file if match. Loop not forwarded in case of three or more consecutive identical pages 
  if comparepdf $pdf1 $pdf2 ; then
     rm $pdf1
  fi
done
#renunite in sorted manner:
pdftk $(find -name 'pg_*.pdf' | sort ) cat output new.pdf

编辑：根据@notauto generated 的评论，人们可能会选择从原始文件中选择页面，而不是统一单页 PDF。两两比较完成后，可以执行以下操作：

pdftk original.pdf cat $(find -name 'pg_*.pdf' |
                        awk -F '[._]' '{printf "%d\n",$3}' |
                        sort -n ) output new.pdf

Question 2

以下是 @FelixJN 代码的修改版本，其中我修复了 printf 格式字符串中的拼写错误。该代码已经过我的验证并且可以正常工作。

#!/bin/bash
pdftk original.pdf burst  #explode the pdf
#the resulting files are named as  pg_0001.pdf, pg_0002.pdf etc.

for (( i=1 ; i<=1140 ; i++ )) ; do #loop over all the signle-page pdf files
  pdf1=pg_$(printf %04d $i).pdf
  pdf2=pg_$(printf %04d $((i+1))).pdf
  echo $pdf1 $pdf2
  if comparepdf $pdf1 $pdf2 ; then
     rm $pdf1  #remove the first if two adjacent files are duplicate
  fi
done
#merge the remained files in sorted manner:
pdftk $(find -name 'pg_*.pdf' | sort ) cat output new.pdf

Answer

以下是 @FelixJN 代码的修改版本，其中我修复了 printf 格式字符串中的拼写错误。该代码已经过我的验证并且可以正常工作。

#!/bin/bash
pdftk original.pdf burst  #explode the pdf
#the resulting files are named as  pg_0001.pdf, pg_0002.pdf etc.

for (( i=1 ; i<=1140 ; i++ )) ; do #loop over all the signle-page pdf files
  pdf1=pg_$(printf %04d $i).pdf
  pdf2=pg_$(printf %04d $((i+1))).pdf
  echo $pdf1 $pdf2
  if comparepdf $pdf1 $pdf2 ; then
     rm $pdf1  #remove the first if two adjacent files are duplicate
  fi
done
#merge the remained files in sorted manner:
pdftk $(find -name 'pg_*.pdf' | sort ) cat output new.pdf

Question 3

如果您无法使用该comparepdf工具，以下对我来说是一个有效的解决方案（使用 FelixJN 的答案）：

#explode pdf
pdftk original.pdf burst

#delete consecutive pages that have the same size        
last=-1; find . -type f -name '*.pdf' -printf '%f\0' | sort -nz | 
    while read -d '' i; do 
        s=$(stat -c '%s' "$i"); 
        [[ $s = $last ]] && rm "$i"; 
    last=$s; 
done

#rearrange the pdf
pdftk original.pdf cat $(find -name 'pg_*.pdf' |
                        awk -F '[._]' '{printf "%d\n",$3}' |
                        sort -n ) output new.pdf

这可能会删除一个本来不该删除的页面，但我认为概率很低。删除相同大小文件的来源：如何删除目录中相同大小的文件？

Answer

如果您无法使用该comparepdf工具，以下对我来说是一个有效的解决方案（使用 FelixJN 的答案）：

#explode pdf
pdftk original.pdf burst

#delete consecutive pages that have the same size        
last=-1; find . -type f -name '*.pdf' -printf '%f\0' | sort -nz | 
    while read -d '' i; do 
        s=$(stat -c '%s' "$i"); 
        [[ $s = $last ]] && rm "$i"; 
    last=$s; 
done

#rearrange the pdf
pdftk original.pdf cat $(find -name 'pg_*.pdf' |
                        awk -F '[._]' '{printf "%d\n",$3}' |
                        sort -n ) output new.pdf

这可能会删除一个本来不该删除的页面，但我认为概率很低。删除相同大小文件的来源：如何删除目录中相同大小的文件？

删除 pdf 中的重复页面

答案1

答案2

答案3

相关内容