仅从第一页 TIFF 获取 hocr 提取的输出

仅从第一页 TIFF 获取 hocr 提取的输出

[示例图片]

我在下面给出的代码中寻求您的指导。我正在运行此代码以将文本从多页 TIFF 提取到 hocr 格式。我从 TIFF 的第一页获得输出,但其余页面被省略。

# Python program to extract text from all the images in a folder
# storing the text in corresponding files in a different folder
# This is for hocr output, but there is error of getting only 1 page
    
from PIL import Image
import pytesseract as pt
import os
pt.pytesseract.tesseract_cmd = r'C:\Users\admin\AppData\Local\Programs\Tesseract-OCR\tesseract.exe'
     
def main():
    # path for the folder for getting the raw images
    path ="D:\\input"
    # path for the folder for getting the output
    tempPath ="D:\\output"
 
    # iterating the images inside the folder
    for imageName in os.listdir(path):
             
        inputPath = os.path.join(path, imageName)
        img = Image.open(inputPath)
 
        # applying ocr using pytesseract for python
           
        text = pt.image_to_pdf_or_hocr(img, extension = 'hocr', config = (r'--oem 3 --psm 6'), lang ="eng")
         
        fullTempPath = os.path.join(tempPath, 'time_'+imageName+".hocr")
        print(text)
  
        # saving the text for every image in a separate .hocr file
        file1 = open(fullTempPath, "wb")
        file1.write(text)
        file1.close()
  
 
if __name__ == '__main__':
    main()

答案1

编辑:

filename我检查了一下,可以得到PILLOW.Image

text = pt.image_to_pdf_or_hocr('D:\\input\\Best time to visit.tiff', extension='hocr', config=(r'--oem 3 --psm 6'), lang="eng")

因此它可以tesseract与原件一起运行tiff,并将所有页面转换为一个文本hocr


原始答案:

tiff从评论中的链接中获取了您的代码,并创建了将每个页面保存在单独文件中的代码。它用于img.seek(page)选择页面。它适用于您的文件。

from PIL import Image
import os

folder = '/home/furas/Desktop'
filename = 'Best time to visit.tiff'

img = Image.open(os.path.join(folder, filename))

page = 0

while True:
    try:
        img.seek(page)

        filename = f'page-{page+1}.png'
        print('saving...', filename)

        img.save(os.path.join(folder, filename))

        page += 1
    except EOFError:
        # Not enough frames in img
        break

你的代码中有类似的功能对我有用

from PIL import Image
import pytesseract as pt
import os

pt.pytesseract.tesseract_cmd = r'C:\Users\admin\AppData\Local\Programs\Tesseract-OCR\tesseract.exe'
     
# path for the folder for getting the raw images
path = "D:\\input"

# path for the folder for getting the output
tempPath = "D:\\output"

# iterating the images inside the folder
for imageName in os.listdir(path):
 
    # only images   
    if imageName.lower().endswith(('.tiff', '.jpg', '.png')):
        print(imageName)
        
        inputPath = os.path.join(path, imageName)
        img = Image.open(inputPath)
    
        page = 0
        while True:
            try:
        
                img.seek(page)
                text = pt.image_to_pdf_or_hocr(img, extension='hocr', config=(r'--oem 3 --psm 6'), lang="eng")
        
                print('page...', page)
                page += 1
         
                fullTempPath = os.path.join(tempPath, f"time_{imageName}_{page}.hocr")
                #print(text)
        
                # saving the text for every image in a separate .hocr file
                file1 = open(fullTempPath, "wb")
                file1.write(text)
                file1.close()
            except EOFError:
                # Not enough frames in img
                break            

它必须将每一页分开写入,.hocr因为如果你尝试在一个文件中写入多个页面.hocr,那么就会造成损坏.hocr

要将所有页面写入一个文件,您需要使用纯文本。

相关内容