Amanzi: Linux / open source OCR batch processing from PDF

Jul 13, 2008

Linux / open source OCR batch processing from PDF

I recently needed to run OCR on a PDF of scanned pages, and found no direct way to do it in Linux, but did find a suitable combination of tools that when scripted together did the job quite nicely. Firstly the job needs to be broken down into two steps:

Extract individual pages from PDF. None of the open source OCR software I read about or tried could run directly on PDF. The easiest way to extract from PDF is to run ghostscript and print to TIFF or PNM, for example:
```
gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE inputfile.pdf
```
Run OCR on individual pages. I tried ocrad and tesseract (versions 1.02 and 2.03). Ocrad supported the Swedish characters I had in my documents but otherwise had rather poor overall OCR performance. Tesseract did not support Swedish characters, but both versions were better than Ocrad, and version 2 was overall the best (and supports many other languages if you bother to train it). Training it on Swedish was more work than manually fixing the results, so I did not take that step, but was certainly tempted.

Since I had several PDF documents to process, and each had many pages, the above process was still too manual, so I wrote a utility Ruby script to do the work for me:

#!/usr/bin/env ruby

(ARGV.length>0) || puts("usage: ./ocr.rb file1.pdf <file2.pdf> ...") || exit(0)
$basedir=Dir.getwd
ARGV.grep(/\.pdf/i).each do |pdf|
      dir = pdf.gsub(/\.pdf/,'')
      dir += '_OCR'
      dir += '.dir' if(dir == pdf)
      Dir.mkdir(dir) unless(File.exist?(dir))
      Dir.chdir(dir)
      puts "Extracting pages from PDF: #{pdf}"
      system "gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE \"#{$basedir}/#{pdf}\""
      tiff_pages = Dir.new('.').grep(/^ocr.*\.tif$/).sort
      puts "Running tesseract OCR on pages: #{tiff_pages.join(', ')}"
      tiff_pages.each do |page|
              page_base = page.gsub(/\.tif.*/,'')
              print "#{page_base} "
              system "/usr/local/bin/tesseract #{page} #{page_base}"
      end
      Dir.chdir($basedir)
      ocr_pages = Dir.new(dir).grep(/^ocr.*\.txt$/).sort
      if ocr_pages && ocr_pages.length>0
              puts "Created OCR result pages: #{ocr_pages.join(', ')}"
              archive = "#{dir}.zip"
              puts "Creating archive of result pages: #{archive}"
              system "zip -r \"#{archive}\" #{ocr_pages.map{|p| "\"#{dir}/#{p}\""}.join(' ')}"
      else
              puts "No OCR result pages found"
      end
      puts ""
end

This script will extract the images to TIFF, run the Tesseract OCR on each page and finally build a ZIP file of the result with a filename similar to the original PDF. So MyDoc.PDF is converted to MyDoc_OCR.ZIP. Intermediate TIFF and TXT files are maintained in a subdirectory (MyDoc/*).

If, on the other hand, I simply did not look far enough and there are better utilities and GUI applications for this on Linux, feel free to comment.

12 comments:

Craig Taverner said...: One point about the script, it runs the gs on the path, but runs tesseract in /usr/local/bin. This is because my ubuntu7.10 had tesseract version 1 on the path, and I downloaded and compiled tesseract version 2 separately, and was testing both. Just edit the script manually to use a different tesseract if necessary (or just remove the /usr/local/bin/).; 14 July, 2008 11:07
Unknown said...: Thanks for the script.
I needed to change the location of tesseract:

system "/usr/bin/tesseract #{page} #{page_base}"

(Ubuntu 8.10); 17 April, 2009 02:49
tsuren said...: i am so impressed. thanks for the code!; 29 May, 2009 22:45
Javier said...: Thank you very much for your useful article. Tesseract worked very well fo me to extract data from a table of an old paper; 16 July, 2009 22:15
Unknown said...: Thanks!
Still trying this out but it looks good so far and saved me the effort of duplicating what you have done; 26 November, 2009 21:05
Konrad Voelkel said...: I wrote a shell script that takes a PDF and transforms it into a searchable PDF (so not only extracting the text but putting the text back into the PDF for full-text search).

I hope this helps :-); 25 January, 2010 03:17
Lawrence Siden said...: Fantastic script! Thanks!

I had to delete the first line since my ruby image is under ~/.rvm/...
and I had to replace /usr/local/bin/tesseract with just "tesseract", but it worked like a champ! Nice work.; 15 November, 2011 16:27
Leonardo Cassarani said...: Is it just me or are your code samples completely unreadable? (Dark blue text over black background).; 04 January, 2012 02:34
Craig Taverner said...: Indeed. I had changed the blog theme, and did not check these older blogs for any color clashes. I've brightened the color, a bit harsh now, but certainly readable.; 04 January, 2012 09:22
Unknown said...: your script had worked fine in the past but since I upgraded to the latest version of my distro its throwing errors

something like

ARGV() ... etc

syntax error .dir '.' unexpected

I made sure the path to ruby was right. My distro has symlinks to /etc/alternatives/ruby which ultimately point to /usr/bin/ruby2.0 but I even tried hard wiring #!/usr/bin/env ruby2.0

no luck -- I really got used to your script and am hoping you can suggest a fix; 06 March, 2014 00:11
Unknown said...: your script is throwing errors in Suse 13.1 -- I have everything installed correctly but the script says the first occurrence of '.'
in '.dir' is unexpected; 06 March, 2014 00:12
Craig Taverner said...: Grant, I just tried this script for the first time in on a new Ubuntu 14.04 and it worked OK. Perhaps you could post the exact errors you are getting, word for word, then it might be possible to figure out.; 21 April, 2014 15:08

Amanzi

Jul 13, 2008

Linux / open source OCR batch processing from PDF

12 comments:

Public Profiles

Facebook Badge