我想要对目录中的所有文件运行此命令。
tesseract /home/kong/Documents/input/248.jpg stdout --psm 1 --oem 1 --dpi 300 tsv >/home/kong/Documents/input/ocr_output/input/248.tsv
输入和输出应该具有相同的数字,如248.jpg
和248.tsv
。我尝试编写一个python脚本,但它导致了分隔符问题。
有人能帮我吗?我是 bash 新手。
这是我写的 Python 脚本
comm = shlex.split(command)
out_dir = '/home/kong/Documents/input/ocr_output/input'
for file in tqdm(files):
base_name = os.path.basename(file)
number = base_name.split('.')[0]
out_path = '>' + out_dir + '/' + number + '.tsv'
comm[1] = file
comm[-1] = out_path
# tsv = number + '.tsv'
with open(out_path, 'w') as f:
subprocess.run(comm, shell=True, stdout=f)
答案1
尝试这个:
source_dir=/your/source/dir
output_dir=/your/output/dir
cd "$source_dir" || exit
for file in *.jpg; do
tesseract "$file" stdout --psm 1 --oem 1 --dpi 300 tsv > "$output_dir/${file%.jpg}.tsv"
done
答案2
作为替代方案,您可以将此脚本与 Python 3.5 或更高版本一起使用。
import os
import subprocess as sp
# input directory
in_dir = '/home/kong/Documents/input/'
# output directory
out_dir = '/home/kong/Documents/input/ocr_output/input/'
# list of files in input directory
files = [f for f in os.listdir(in_dir)
if os.path.isfile(os.path.join(in_dir, f))]
for file in files:
# input file
in_file = os.path.join(in_dir, file)
basename = os.path.splitext(file)[0]
# output file
out_file = os.path.join(out_dir, basename + '.tsv')
# run command and save its output to out with utf-8 encoding
out = sp.run(['tesseract', in_file, 'stdout', '--psm', '1',
'--oem', '1', '--dpi', '300', 'tsv'],
stdout=sp.PIPE).stdout.decode('utf-8')
# save command output to file
with open(out_file, 'w') as f:
f.write(out)