我正在扫描某人的个人档案,并想发布隐藏电子邮件地址的扫描文件。
PDF 文件有一个文本层,因此可以用来查找字符串。
我想要的是工具/脚本将在文本层中找到电子邮件地址的字符串,删除文本并在其上放置一个黑色矩形。
我不期望任何工具或脚本能够完美地完成此任务。
有谁知道有可以做到这一点的工具或脚本吗?
我熟悉使用 Python 处理 PDF,因此如果您认为这种方式可行,我很乐意获得一个示例链接,我可以对其进行修改以满足我的具体用途。
谢谢!
答案1
你应该能够使用韓譯本。
参见脚本示例(来自https://www.geeksforgeeks.org/pdf-redaction-using-python/):
# imports
import fitz
import re
class Redactor:
# static methods work independent of class object
@staticmethod
def get_sensitive_data(lines):
""" Function to get all the lines """
# email regex
EMAIL_REG = r"([\w\.\d]+\@[\w\d]+\.[\w\d]+)"
for line in lines:
# matching the regex to each line
if re.search(EMAIL_REG, line, re.IGNORECASE):
search = re.search(EMAIL_REG, line, re.IGNORECASE)
# yields creates a generator
# generator is used to return
# values in between function iterations
yield search.group(1)
# constructor
def __init__(self, path):
self.path = path
def redaction(self):
""" main redactor code """
# opening the pdf
doc = fitz.open(self.path)
# iterating through pages
for page in doc:
# _wrapContents is needed for fixing
# alignment issues with rect boxes in some
# cases where there is alignment issue
page._wrapContents()
# getting the rect boxes which consists the matching email regex
sensitive = self.get_sensitive_data(page.getText("text")
.split('\n'))
for data in sensitive:
areas = page.searchFor(data)
# drawing outline over sensitive datas
[page.addRedactAnnot(area, fill = (0, 0, 0)) for area in areas]
# applying the redaction
page.apply_redactions()
# saving it to a new pdf
doc.save('redacted.pdf')
print("Successfully redacted")
# driver code for testing
if __name__ == "__main__":
# replace it with name of the pdf file
path = 'testing.pdf'
redactor = Redactor(path)
redactor.redaction()