我该如何清理旧的学术论文 pdf 文件？

Question

PDF 看起来像是扫描的，然后进行了字符识别。这意味着您看到的字符实际上是填满页面的图像的一部分，而您选择的是位于图像上方的透明字符。

因此，没有真正的方法可以完全保留外观，同时使字体清晰易读。最好的办法是找到纯数字版本的论文。

编辑：根据您的评论，我编写了一个执行您想要的任务的小脚本：

require 'hexapdf'

# This processor changes the font color to black.
class ContentProcessor

  attr_reader :result

  def initialize
    @result = ''.b
    @serializer = HexaPDF::Serializer.new
  end

  TEXT_SHOW_OPERATORS = [:Tj, :"'", :'"', :TJ].each_with_object({}) {|op, h| h[op] = true }

  def process(op, operands)
    if TEXT_SHOW_OPERATORS[op]
      @result << HexaPDF::Content::Operator::DEFAULT_OPERATORS[:g].
        serialize(@serializer, 0)
      @result << HexaPDF::Content::Operator::DEFAULT_OPERATORS[:Tr].
        serialize(@serializer, 0)
    end
    if op != :Do
      @result << HexaPDF::Content::Operator::DEFAULT_OPERATORS[op].
        serialize(@serializer, *operands)
    end
  end

end

HexaPDF::Document.open(ARGV[0]) do |doc|
  doc.pages.each do |page|
    processor = ContentProcessor.new
    HexaPDF::Content::Parser.parse(page.contents, processor)
    page.contents = processor.result
    page[:Contents].set_filter(:FlateDecode)
  end
  doc.write(ARGV[1], validate: false)
end

这使用HexaPDF 库在引擎盖下（nb 我是 HexaPDF 的作者）并且可以像这样运行：ruby script.rb INPUT.PDF OUTPUT.PDF。

我已经在你的示例 PDF 上运行了脚本，并得到了此输出。大部分都还好，但是肯定有错误。

Answer 1

PDF 看起来像是扫描的，然后进行了字符识别。这意味着您看到的字符实际上是填满页面的图像的一部分，而您选择的是位于图像上方的透明字符。

因此，没有真正的方法可以完全保留外观，同时使字体清晰易读。最好的办法是找到纯数字版本的论文。

编辑：根据您的评论，我编写了一个执行您想要的任务的小脚本：

require 'hexapdf'

# This processor changes the font color to black.
class ContentProcessor

  attr_reader :result

  def initialize
    @result = ''.b
    @serializer = HexaPDF::Serializer.new
  end

  TEXT_SHOW_OPERATORS = [:Tj, :"'", :'"', :TJ].each_with_object({}) {|op, h| h[op] = true }

  def process(op, operands)
    if TEXT_SHOW_OPERATORS[op]
      @result << HexaPDF::Content::Operator::DEFAULT_OPERATORS[:g].
        serialize(@serializer, 0)
      @result << HexaPDF::Content::Operator::DEFAULT_OPERATORS[:Tr].
        serialize(@serializer, 0)
    end
    if op != :Do
      @result << HexaPDF::Content::Operator::DEFAULT_OPERATORS[op].
        serialize(@serializer, *operands)
    end
  end

end

HexaPDF::Document.open(ARGV[0]) do |doc|
  doc.pages.each do |page|
    processor = ContentProcessor.new
    HexaPDF::Content::Parser.parse(page.contents, processor)
    page.contents = processor.result
    page[:Contents].set_filter(:FlateDecode)
  end
  doc.write(ARGV[1], validate: false)
end

这使用HexaPDF 库在引擎盖下（nb 我是 HexaPDF 的作者）并且可以像这样运行：ruby script.rb INPUT.PDF OUTPUT.PDF。

我已经在你的示例 PDF 上运行了脚本，并得到了此输出。大部分都还好，但是肯定有错误。

我该如何清理旧的学术论文 pdf 文件？

答案1

相关内容