同时使用 Tesseract hocr 和 txt，或者将 Tesseracts hocr 转换为 txt

Question

<?php 
/**
 * Cli process that gets as 1st argument the output of tesseract ... hocr and dumps 
 * its text nodes
 * Usage: script.php in.tif.html out.txt
 */
$inFile = $argv[1];
$outFile = $argv[2];
$stream = file_get_contents($inFile);
$dom = DOMDocument::loadHTML($stream);
$out = array();
foreach ($dom->getElementsByTagName('p') as $tag) {
    $out[] = $tag->nodeValue;
}

file_put_contents($outFile, implode("\n", $out));

Answer 1

<?php 
/**
 * Cli process that gets as 1st argument the output of tesseract ... hocr and dumps 
 * its text nodes
 * Usage: script.php in.tif.html out.txt
 */
$inFile = $argv[1];
$outFile = $argv[2];
$stream = file_get_contents($inFile);
$dom = DOMDocument::loadHTML($stream);
$out = array();
foreach ($dom->getElementsByTagName('p') as $tag) {
    $out[] = $tag->nodeValue;
}

file_put_contents($outFile, implode("\n", $out));

同时使用 Tesseract hocr 和 txt，或者将 Tesseracts hocr 转换为 txt

答案1

相关内容