- Extract individual pages from PDF. None of the open source OCR software I read about or tried could run directly on PDF. The easiest way to extract from PDF is to run ghostscript and print to TIFF or PNM, for example:
gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE inputfile.pdf
- Run OCR on individual pages. I tried ocrad and tesseract (versions 1.02 and 2.03). Ocrad supported the Swedish characters I had in my documents but otherwise had rather poor overall OCR performance. Tesseract did not support Swedish characters, but both versions were better than Ocrad, and version 2 was overall the best (and supports many other languages if you bother to train it). Training it on Swedish was more work than manually fixing the results, so I did not take that step, but was certainly tempted.
#!/usr/bin/env ruby
(ARGV.length>0) || puts("usage: ./ocr.rb file1.pdf <file2.pdf> ...") || exit(0)
$basedir=Dir.getwd
ARGV.grep(/\.pdf/i).each do |pdf|
dir = pdf.gsub(/\.pdf/,'')
dir += '_OCR'
dir += '.dir' if(dir == pdf)
Dir.mkdir(dir) unless(File.exist?(dir))
Dir.chdir(dir)
puts "Extracting pages from PDF: #{pdf}"
system "gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE \"#{$basedir}/#{pdf}\""
tiff_pages = Dir.new('.').grep(/^ocr.*\.tif$/).sort
puts "Running tesseract OCR on pages: #{tiff_pages.join(', ')}"
tiff_pages.each do |page|
page_base = page.gsub(/\.tif.*/,'')
print "#{page_base} "
system "/usr/local/bin/tesseract #{page} #{page_base}"
end
Dir.chdir($basedir)
ocr_pages = Dir.new(dir).grep(/^ocr.*\.txt$/).sort
if ocr_pages && ocr_pages.length>0
puts "Created OCR result pages: #{ocr_pages.join(', ')}"
archive = "#{dir}.zip"
puts "Creating archive of result pages: #{archive}"
system "zip -r \"#{archive}\" #{ocr_pages.map{|p| "\"#{dir}/#{p}\""}.join(' ')}"
else
puts "No OCR result pages found"
end
puts ""
end
This script will extract the images to TIFF, run the Tesseract OCR on each page and finally build a ZIP file of the result with a filename similar to the original PDF. So MyDoc.PDF is converted to MyDoc_OCR.ZIP. Intermediate TIFF and TXT files are maintained in a subdirectory (MyDoc/*).
If, on the other hand, I simply did not look far enough and there are better utilities and GUI applications for this on Linux, feel free to comment.