- Extract individual pages from PDF. None of the open source OCR software I read about or tried could run directly on PDF. The easiest way to extract from PDF is to run ghostscript and print to TIFF or PNM, for example:
gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE inputfile.pdf
- Run OCR on individual pages. I tried ocrad and tesseract (versions 1.02 and 2.03). Ocrad supported the Swedish characters I had in my documents but otherwise had rather poor overall OCR performance. Tesseract did not support Swedish characters, but both versions were better than Ocrad, and version 2 was overall the best (and supports many other languages if you bother to train it). Training it on Swedish was more work than manually fixing the results, so I did not take that step, but was certainly tempted.
#!/usr/bin/env ruby (ARGV.length>0) || puts("usage: ./ocr.rb file1.pdf <file2.pdf> ...") || exit(0) $basedir=Dir.getwd ARGV.grep(/\.pdf/i).each do |pdf| dir = pdf.gsub(/\.pdf/,'') dir += '_OCR' dir += '.dir' if(dir == pdf) Dir.mkdir(dir) unless(File.exist?(dir)) Dir.chdir(dir) puts "Extracting pages from PDF: #{pdf}" system "gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE \"#{$basedir}/#{pdf}\"" tiff_pages = Dir.new('.').grep(/^ocr.*\.tif$/).sort puts "Running tesseract OCR on pages: #{tiff_pages.join(', ')}" tiff_pages.each do |page| page_base = page.gsub(/\.tif.*/,'') print "#{page_base} " system "/usr/local/bin/tesseract #{page} #{page_base}" end Dir.chdir($basedir) ocr_pages = Dir.new(dir).grep(/^ocr.*\.txt$/).sort if ocr_pages && ocr_pages.length>0 puts "Created OCR result pages: #{ocr_pages.join(', ')}" archive = "#{dir}.zip" puts "Creating archive of result pages: #{archive}" system "zip -r \"#{archive}\" #{ocr_pages.map{|p| "\"#{dir}/#{p}\""}.join(' ')}" else puts "No OCR result pages found" end puts "" end
This script will extract the images to TIFF, run the Tesseract OCR on each page and finally build a ZIP file of the result with a filename similar to the original PDF. So MyDoc.PDF is converted to MyDoc_OCR.ZIP. Intermediate TIFF and TXT files are maintained in a subdirectory (MyDoc/*).
If, on the other hand, I simply did not look far enough and there are better utilities and GUI applications for this on Linux, feel free to comment.
One point about the script, it runs the gs on the path, but runs tesseract in /usr/local/bin. This is because my ubuntu7.10 had tesseract version 1 on the path, and I downloaded and compiled tesseract version 2 separately, and was testing both. Just edit the script manually to use a different tesseract if necessary (or just remove the /usr/local/bin/).
ReplyDeleteThanks for the script.
ReplyDeleteI needed to change the location of tesseract:
system "/usr/bin/tesseract #{page} #{page_base}"
(Ubuntu 8.10)
i am so impressed. thanks for the code!
ReplyDeleteThank you very much for your useful article. Tesseract worked very well fo me to extract data from a table of an old paper
ReplyDeleteThanks!
ReplyDeleteStill trying this out but it looks good so far and saved me the effort of duplicating what you have done
I wrote a shell script that takes a PDF and transforms it into a searchable PDF (so not only extracting the text but putting the text back into the PDF for full-text search).
ReplyDeleteI hope this helps :-)
Fantastic script! Thanks!
ReplyDeleteI had to delete the first line since my ruby image is under ~/.rvm/...
and I had to replace /usr/local/bin/tesseract with just "tesseract", but it worked like a champ! Nice work.
Is it just me or are your code samples completely unreadable? (Dark blue text over black background).
ReplyDeleteIndeed. I had changed the blog theme, and did not check these older blogs for any color clashes. I've brightened the color, a bit harsh now, but certainly readable.
ReplyDeleteyour script had worked fine in the past but since I upgraded to the latest version of my distro its throwing errors
ReplyDeletesomething like
ARGV() ... etc
syntax error .dir '.' unexpected
I made sure the path to ruby was right. My distro has symlinks to /etc/alternatives/ruby which ultimately point to /usr/bin/ruby2.0 but I even tried hard wiring #!/usr/bin/env ruby2.0
no luck -- I really got used to your script and am hoping you can suggest a fix
your script is throwing errors in Suse 13.1 -- I have everything installed correctly but the script says the first occurrence of '.'
ReplyDeletein '.dir' is unexpected
Grant, I just tried this script for the first time in on a new Ubuntu 14.04 and it worked OK. Perhaps you could post the exact errors you are getting, word for word, then it might be possible to figure out.
ReplyDelete