- Extract individual pages from PDF. None of the open source OCR software I read about or tried could run directly on PDF. The easiest way to extract from PDF is to run ghostscript and print to TIFF or PNM, for example:
gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE inputfile.pdf
- Run OCR on individual pages. I tried ocrad and tesseract (versions 1.02 and 2.03). Ocrad supported the Swedish characters I had in my documents but otherwise had rather poor overall OCR performance. Tesseract did not support Swedish characters, but both versions were better than Ocrad, and version 2 was overall the best (and supports many other languages if you bother to train it). Training it on Swedish was more work than manually fixing the results, so I did not take that step, but was certainly tempted.
#!/usr/bin/env ruby (ARGV.length>0) || puts("usage: ./ocr.rb file1.pdf <file2.pdf> ...") || exit(0) $basedir=Dir.getwd ARGV.grep(/\.pdf/i).each do |pdf| dir = pdf.gsub(/\.pdf/,'') dir += '_OCR' dir += '.dir' if(dir == pdf) Dir.mkdir(dir) unless(File.exist?(dir)) Dir.chdir(dir) puts "Extracting pages from PDF: #{pdf}" system "gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE \"#{$basedir}/#{pdf}\"" tiff_pages = Dir.new('.').grep(/^ocr.*\.tif$/).sort puts "Running tesseract OCR on pages: #{tiff_pages.join(', ')}" tiff_pages.each do |page| page_base = page.gsub(/\.tif.*/,'') print "#{page_base} " system "/usr/local/bin/tesseract #{page} #{page_base}" end Dir.chdir($basedir) ocr_pages = Dir.new(dir).grep(/^ocr.*\.txt$/).sort if ocr_pages && ocr_pages.length>0 puts "Created OCR result pages: #{ocr_pages.join(', ')}" archive = "#{dir}.zip" puts "Creating archive of result pages: #{archive}" system "zip -r \"#{archive}\" #{ocr_pages.map{|p| "\"#{dir}/#{p}\""}.join(' ')}" else puts "No OCR result pages found" end puts "" end
This script will extract the images to TIFF, run the Tesseract OCR on each page and finally build a ZIP file of the result with a filename similar to the original PDF. So MyDoc.PDF is converted to MyDoc_OCR.ZIP. Intermediate TIFF and TXT files are maintained in a subdirectory (MyDoc/*).
If, on the other hand, I simply did not look far enough and there are better utilities and GUI applications for this on Linux, feel free to comment.
12 comments:
One point about the script, it runs the gs on the path, but runs tesseract in /usr/local/bin. This is because my ubuntu7.10 had tesseract version 1 on the path, and I downloaded and compiled tesseract version 2 separately, and was testing both. Just edit the script manually to use a different tesseract if necessary (or just remove the /usr/local/bin/).
Thanks for the script.
I needed to change the location of tesseract:
system "/usr/bin/tesseract #{page} #{page_base}"
(Ubuntu 8.10)
i am so impressed. thanks for the code!
Thank you very much for your useful article. Tesseract worked very well fo me to extract data from a table of an old paper
Thanks!
Still trying this out but it looks good so far and saved me the effort of duplicating what you have done
I wrote a shell script that takes a PDF and transforms it into a searchable PDF (so not only extracting the text but putting the text back into the PDF for full-text search).
I hope this helps :-)
Fantastic script! Thanks!
I had to delete the first line since my ruby image is under ~/.rvm/...
and I had to replace /usr/local/bin/tesseract with just "tesseract", but it worked like a champ! Nice work.
Is it just me or are your code samples completely unreadable? (Dark blue text over black background).
Indeed. I had changed the blog theme, and did not check these older blogs for any color clashes. I've brightened the color, a bit harsh now, but certainly readable.
your script had worked fine in the past but since I upgraded to the latest version of my distro its throwing errors
something like
ARGV() ... etc
syntax error .dir '.' unexpected
I made sure the path to ruby was right. My distro has symlinks to /etc/alternatives/ruby which ultimately point to /usr/bin/ruby2.0 but I even tried hard wiring #!/usr/bin/env ruby2.0
no luck -- I really got used to your script and am hoping you can suggest a fix
your script is throwing errors in Suse 13.1 -- I have everything installed correctly but the script says the first occurrence of '.'
in '.dir' is unexpected
Grant, I just tried this script for the first time in on a new Ubuntu 14.04 and it worked OK. Perhaps you could post the exact errors you are getting, word for word, then it might be possible to figure out.
Post a Comment