Jul 13, 2008

Linux / open source OCR batch processing from PDF

I recently needed to run OCR on a PDF of scanned pages, and found no direct way to do it in Linux, but did find a suitable combination of tools that when scripted together did the job quite nicely. Firstly the job needs to be broken down into two steps:
  1. Extract individual pages from PDF. None of the open source OCR software I read about or tried could run directly on PDF. The easiest way to extract from PDF is to run ghostscript and print to TIFF or PNM, for example:
    gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE inputfile.pdf
  2. Run OCR on individual pages. I tried ocrad and tesseract (versions 1.02 and 2.03). Ocrad supported the Swedish characters I had in my documents but otherwise had rather poor overall OCR performance. Tesseract did not support Swedish characters, but both versions were better than Ocrad, and version 2 was overall the best (and supports many other languages if you bother to train it). Training it on Swedish was more work than manually fixing the results, so I did not take that step, but was certainly tempted.
Since I had several PDF documents to process, and each had many pages, the above process was still too manual, so I wrote a utility Ruby script to do the work for me:
#!/usr/bin/env ruby

(ARGV.length>0) || puts("usage: ./ocr.rb file1.pdf <file2.pdf> ...") || exit(0)
ARGV.grep(/\.pdf/i).each do |pdf|
      dir = pdf.gsub(/\.pdf/,'')
      dir += '_OCR'
      dir += '.dir' if(dir == pdf)
      Dir.mkdir(dir) unless(File.exist?(dir))
      puts "Extracting pages from PDF: #{pdf}"
      system "gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE \"#{$basedir}/#{pdf}\""
      tiff_pages = Dir.new('.').grep(/^ocr.*\.tif$/).sort
      puts "Running tesseract OCR on pages: #{tiff_pages.join(', ')}"
      tiff_pages.each do |page|
              page_base = page.gsub(/\.tif.*/,'')
              print "#{page_base} "
              system "/usr/local/bin/tesseract #{page} #{page_base}"
      ocr_pages = Dir.new(dir).grep(/^ocr.*\.txt$/).sort
      if ocr_pages && ocr_pages.length>0
              puts "Created OCR result pages: #{ocr_pages.join(', ')}"
              archive = "#{dir}.zip"
              puts "Creating archive of result pages: #{archive}"
              system "zip -r \"#{archive}\" #{ocr_pages.map{|p| "\"#{dir}/#{p}\""}.join(' ')}"
              puts "No OCR result pages found"
      puts ""

This script will extract the images to TIFF, run the Tesseract OCR on each page and finally build a ZIP file of the result with a filename similar to the original PDF. So MyDoc.PDF is converted to MyDoc_OCR.ZIP. Intermediate TIFF and TXT files are maintained in a subdirectory (MyDoc/*).

If, on the other hand, I simply did not look far enough and there are better utilities and GUI applications for this on Linux, feel free to comment.


Craig Taverner said...

One point about the script, it runs the gs on the path, but runs tesseract in /usr/local/bin. This is because my ubuntu7.10 had tesseract version 1 on the path, and I downloaded and compiled tesseract version 2 separately, and was testing both. Just edit the script manually to use a different tesseract if necessary (or just remove the /usr/local/bin/).

eduardo said...

Thanks for the script.
I needed to change the location of tesseract:

system "/usr/bin/tesseract #{page} #{page_base}"

(Ubuntu 8.10)

tsurenjena said...

i am so impressed. thanks for the code!

Washington said...

Thank you very much for your useful article. Tesseract worked very well fo me to extract data from a table of an old paper

Tony said...

Still trying this out but it looks good so far and saved me the effort of duplicating what you have done

Konrad Völkel said...

I wrote a shell script that takes a PDF and transforms it into a searchable PDF (so not only extracting the text but putting the text back into the PDF for full-text search).

I hope this helps :-)

lsiden said...

Fantastic script! Thanks!

I had to delete the first line since my ruby image is under ~/.rvm/...
and I had to replace /usr/local/bin/tesseract with just "tesseract", but it worked like a champ! Nice work.

Leonardo Cassarani said...

Is it just me or are your code samples completely unreadable? (Dark blue text over black background).

Craig Taverner said...

Indeed. I had changed the blog theme, and did not check these older blogs for any color clashes. I've brightened the color, a bit harsh now, but certainly readable.

Grant Izmirlian said...

your script had worked fine in the past but since I upgraded to the latest version of my distro its throwing errors

something like

ARGV() ... etc

syntax error .dir '.' unexpected

I made sure the path to ruby was right. My distro has symlinks to /etc/alternatives/ruby which ultimately point to /usr/bin/ruby2.0 but I even tried hard wiring #!/usr/bin/env ruby2.0

no luck -- I really got used to your script and am hoping you can suggest a fix

Grant Izmirlian said...

your script is throwing errors in Suse 13.1 -- I have everything installed correctly but the script says the first occurrence of '.'
in '.dir' is unexpected

Craig Taverner said...

Grant, I just tried this script for the first time in on a new Ubuntu 14.04 and it worked OK. Perhaps you could post the exact errors you are getting, word for word, then it might be possible to figure out.