Amanzi: Linux / open source OCR batch processing from PDF

Jul 13, 2008

Linux / open source OCR batch processing from PDF

I recently needed to run OCR on a PDF of scanned pages, and found no direct way to do it in Linux, but did find a suitable combination of tools that when scripted together did the job quite nicely. Firstly the job needs to be broken down into two steps:

Extract individual pages from PDF. None of the open source OCR software I read about or tried could run directly on PDF. The easiest way to extract from PDF is to run ghostscript and print to TIFF or PNM, for example:
```
gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE inputfile.pdf
```
Run OCR on individual pages. I tried ocrad and tesseract (versions 1.02 and 2.03). Ocrad supported the Swedish characters I had in my documents but otherwise had rather poor overall OCR performance. Tesseract did not support Swedish characters, but both versions were better than Ocrad, and version 2 was overall the best (and supports many other languages if you bother to train it). Training it on Swedish was more work than manually fixing the results, so I did not take that step, but was certainly tempted.

Since I had several PDF documents to process, and each had many pages, the above process was still too manual, so I wrote a utility Ruby script to do the work for me:

#!/usr/bin/env ruby

(ARGV.length>0) || puts("usage: ./ocr.rb file1.pdf <file2.pdf> ...") || exit(0)
$basedir=Dir.getwd
ARGV.grep(/\.pdf/i).each do |pdf|
      dir = pdf.gsub(/\.pdf/,'')
      dir += '_OCR'
      dir += '.dir' if(dir == pdf)
      Dir.mkdir(dir) unless(File.exist?(dir))
      Dir.chdir(dir)
      puts "Extracting pages from PDF: #{pdf}"
      system "gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE \"#{$basedir}/#{pdf}\""
      tiff_pages = Dir.new('.').grep(/^ocr.*\.tif$/).sort
      puts "Running tesseract OCR on pages: #{tiff_pages.join(', ')}"
      tiff_pages.each do |page|
              page_base = page.gsub(/\.tif.*/,'')
              print "#{page_base} "
              system "/usr/local/bin/tesseract #{page} #{page_base}"
      end
      Dir.chdir($basedir)
      ocr_pages = Dir.new(dir).grep(/^ocr.*\.txt$/).sort
      if ocr_pages && ocr_pages.length>0
              puts "Created OCR result pages: #{ocr_pages.join(', ')}"
              archive = "#{dir}.zip"
              puts "Creating archive of result pages: #{archive}"
              system "zip -r \"#{archive}\" #{ocr_pages.map{|p| "\"#{dir}/#{p}\""}.join(' ')}"
      else
              puts "No OCR result pages found"
      end
      puts ""
end

This script will extract the images to TIFF, run the Tesseract OCR on each page and finally build a ZIP file of the result with a filename similar to the original PDF. So MyDoc.PDF is converted to MyDoc_OCR.ZIP. Intermediate TIFF and TXT files are maintained in a subdirectory (MyDoc/*).

If, on the other hand, I simply did not look far enough and there are better utilities and GUI applications for this on Linux, feel free to comment.

12 comments:

Craig Taverner14 July, 2008 11:07
One point about the script, it runs the gs on the path, but runs tesseract in /usr/local/bin. This is because my ubuntu7.10 had tesseract version 1 on the path, and I downloaded and compiled tesseract version 2 separately, and was testing both. Just edit the script manually to use a different tesseract if necessary (or just remove the /usr/local/bin/).
ReplyDelete
Replies
Unknown17 April, 2009 02:49
Thanks for the script.
I needed to change the location of tesseract:

system "/usr/bin/tesseract #{page} #{page_base}"

(Ubuntu 8.10)
ReplyDelete
Replies
tsuren29 May, 2009 22:45
i am so impressed. thanks for the code!
ReplyDelete
Replies
Javier16 July, 2009 22:15
Thank you very much for your useful article. Tesseract worked very well fo me to extract data from a table of an old paper
ReplyDelete
Replies
Unknown26 November, 2009 21:05
Thanks!
Still trying this out but it looks good so far and saved me the effort of duplicating what you have done
ReplyDelete
Replies
Konrad Voelkel25 January, 2010 03:17
I wrote a shell script that takes a PDF and transforms it into a searchable PDF (so not only extracting the text but putting the text back into the PDF for full-text search).

I hope this helps :-)
ReplyDelete
Replies
Lawrence Siden15 November, 2011 16:27
Fantastic script! Thanks!

I had to delete the first line since my ruby image is under ~/.rvm/...
and I had to replace /usr/local/bin/tesseract with just "tesseract", but it worked like a champ! Nice work.
ReplyDelete
Replies
Leonardo Cassarani04 January, 2012 02:34
Is it just me or are your code samples completely unreadable? (Dark blue text over black background).
ReplyDelete
Replies
Craig Taverner04 January, 2012 09:22
Indeed. I had changed the blog theme, and did not check these older blogs for any color clashes. I've brightened the color, a bit harsh now, but certainly readable.
ReplyDelete
Replies
Unknown06 March, 2014 00:11
your script had worked fine in the past but since I upgraded to the latest version of my distro its throwing errors

something like

ARGV() ... etc

syntax error .dir '.' unexpected

I made sure the path to ruby was right. My distro has symlinks to /etc/alternatives/ruby which ultimately point to /usr/bin/ruby2.0 but I even tried hard wiring #!/usr/bin/env ruby2.0

no luck -- I really got used to your script and am hoping you can suggest a fix
ReplyDelete
Replies
Unknown06 March, 2014 00:12
your script is throwing errors in Suse 13.1 -- I have everything installed correctly but the script says the first occurrence of '.'
in '.dir' is unexpected
ReplyDelete
Replies
Craig Taverner21 April, 2014 15:08
Grant, I just tried this script for the first time in on a new Ubuntu 14.04 and it worked OK. Perhaps you could post the exact errors you are getting, word for word, then it might be possible to figure out.
ReplyDelete
Replies

Add comment