- Extract individual pages from PDF. None of the open source OCR software I read about or tried could run directly on PDF. The easiest way to extract from PDF is to run ghostscript and print to TIFF or PNM, for example:
gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE inputfile.pdf
- Run OCR on individual pages. I tried ocrad and tesseract (versions 1.02 and 2.03). Ocrad supported the Swedish characters I had in my documents but otherwise had rather poor overall OCR performance. Tesseract did not support Swedish characters, but both versions were better than Ocrad, and version 2 was overall the best (and supports many other languages if you bother to train it). Training it on Swedish was more work than manually fixing the results, so I did not take that step, but was certainly tempted.
#!/usr/bin/env ruby (ARGV.length>0) || puts("usage: ./ocr.rb file1.pdf <file2.pdf> ...") || exit(0) $basedir=Dir.getwd ARGV.grep(/\.pdf/i).each do |pdf| dir = pdf.gsub(/\.pdf/,'') dir += '_OCR' dir += '.dir' if(dir == pdf) Dir.mkdir(dir) unless(File.exist?(dir)) Dir.chdir(dir) puts "Extracting pages from PDF: #{pdf}" system "gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE \"#{$basedir}/#{pdf}\"" tiff_pages = Dir.new('.').grep(/^ocr.*\.tif$/).sort puts "Running tesseract OCR on pages: #{tiff_pages.join(', ')}" tiff_pages.each do |page| page_base = page.gsub(/\.tif.*/,'') print "#{page_base} " system "/usr/local/bin/tesseract #{page} #{page_base}" end Dir.chdir($basedir) ocr_pages = Dir.new(dir).grep(/^ocr.*\.txt$/).sort if ocr_pages && ocr_pages.length>0 puts "Created OCR result pages: #{ocr_pages.join(', ')}" archive = "#{dir}.zip" puts "Creating archive of result pages: #{archive}" system "zip -r \"#{archive}\" #{ocr_pages.map{|p| "\"#{dir}/#{p}\""}.join(' ')}" else puts "No OCR result pages found" end puts "" end
This script will extract the images to TIFF, run the Tesseract OCR on each page and finally build a ZIP file of the result with a filename similar to the original PDF. So MyDoc.PDF is converted to MyDoc_OCR.ZIP. Intermediate TIFF and TXT files are maintained in a subdirectory (MyDoc/*).
If, on the other hand, I simply did not look far enough and there are better utilities and GUI applications for this on Linux, feel free to comment.