寻找可以隐藏 PDF 文件中电子邮件地址的工具或脚本

寻找可以隐藏 PDF 文件中电子邮件地址的工具或脚本

我正在扫描某人的个人档案,并想发布隐藏电子邮件地址的扫描文件。

PDF 文件有一个文本层,因此可以用来查找字符串。

我想要的是工具/脚本将在文本层中找到电子邮件地址的字符串,删除文本并在其上放置一个黑色矩形。

我不期望任何工具或脚本能够完美地完成此任务。

有谁知道有可以做到这一点的工具或脚本吗?

我熟悉使用 Python 处理 PDF,因此如果您认为这种方式可行,我很乐意获得一个示例链接,我可以对其进行修改以满足我的具体用途。

谢谢!

答案1

你应该能够使用韓譯本

参见脚本示例(来自https://www.geeksforgeeks.org/pdf-redaction-using-python/):

# imports
import fitz
import re
 
class Redactor:
   
    # static methods work independent of class object
    @staticmethod
    def get_sensitive_data(lines):
       
        """ Function to get all the lines """
         
        # email regex
        EMAIL_REG = r"([\w\.\d]+\@[\w\d]+\.[\w\d]+)"
        for line in lines:
           
            # matching the regex to each line
            if re.search(EMAIL_REG, line, re.IGNORECASE):
                search = re.search(EMAIL_REG, line, re.IGNORECASE)
                 
                # yields creates a generator
                # generator is used to return
                # values in between function iterations
                yield search.group(1)
 
    # constructor
    def __init__(self, path):
        self.path = path
 
    def redaction(self):
       
        """ main redactor code """
         
        # opening the pdf
        doc = fitz.open(self.path)
         
        # iterating through pages
        for page in doc:
           
            # _wrapContents is needed for fixing
            # alignment issues with rect boxes in some
            # cases where there is alignment issue
            page._wrapContents()
             
            # getting the rect boxes which consists the matching email regex
            sensitive = self.get_sensitive_data(page.getText("text")
                                                .split('\n'))
            for data in sensitive:
                areas = page.searchFor(data)
                 
                # drawing outline over sensitive datas
                [page.addRedactAnnot(area, fill = (0, 0, 0)) for area in areas]
                 
            # applying the redaction
            page.apply_redactions()
             
        # saving it to a new pdf
        doc.save('redacted.pdf')
        print("Successfully redacted")
 
# driver code for testing
if __name__ == "__main__":
   
    # replace it with name of the pdf file
    path = 'testing.pdf'
    redactor = Redactor(path)
    redactor.redaction()

相关内容