Sep 3, 2008

Install or upgrade to ubuntu 8.04 on linux with no media

I had a problem, my ubuntu server was too old for automatic updates, and the CD-ROM drive was broken and I'm allergic to floppies. A quick internet search lead to three options:
  • instlux, a nice graphical installer to run under windows
  • UNetbootin, a really nice graphical installer for linux and windows
  • A grub trick for booting the installer from grub under windows described in detail for any linux, and in less detail but for ubuntu.
The first option was no good, because it only ran on windows. The second looked really neat and easy, and is probably the best, but being the geek that I am, I wanted to try the ideas in the third option, but 'translated' to work on my old linux (in my case ubuntu 5.04). It turned out to be pretty easy. Here are the steps I used:
  • I opened my downloaded ubuntu 8.04.1 server ISO image in archive manager and extracted the 'install' directory to /boot/install on my old computer. I did this with another 8.04 desktop, but could just as easily done it with the old computer itself.
  • I edited /boot/grub/menu.lst, adding the following lines at the bottom:
    title           Ubuntu 8.04.1 Installer (hd0,0)
    kernel (hd0,0)/boot/install/netboot/ubuntu-installer/i386/linux vga=normal ramdisk_size=14972 root=/dev/rd/0 rw --
    initrd (hd0,0)/boot/install/netboot/ubuntu-installer/i386/initrd.gz
    (I actually first tried the vmlinuz and initrd.gz I found in the installer directory, but that insisted on a CD, and I did not want to try faking that with a raw partition, so I changed to the netboot option in the text above.)
  • I also commented out 'hiddenmenu' and added 'timeout 10' to menu.lst so that I would actually get to see the menu choice when I rebooted.
  • Finally I rebooted and choose the new installer, and, after answering a bunch of question, viola, I had a new Ubuntu 8.04.1 server!
These instructions assume a decently fast internet access, since everything installed is downloaded. If you have the CD already (as I had), and no internet, or slow internet, you can also copy the CD to a local hard-drive partition and install from there. That was too much trouble for me, so I did not try it :-)

Jul 13, 2008

Linux / open source OCR batch processing from PDF

I recently needed to run OCR on a PDF of scanned pages, and found no direct way to do it in Linux, but did find a suitable combination of tools that when scripted together did the job quite nicely. Firstly the job needs to be broken down into two steps:
  1. Extract individual pages from PDF. None of the open source OCR software I read about or tried could run directly on PDF. The easiest way to extract from PDF is to run ghostscript and print to TIFF or PNM, for example:
    gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE inputfile.pdf
  2. Run OCR on individual pages. I tried ocrad and tesseract (versions 1.02 and 2.03). Ocrad supported the Swedish characters I had in my documents but otherwise had rather poor overall OCR performance. Tesseract did not support Swedish characters, but both versions were better than Ocrad, and version 2 was overall the best (and supports many other languages if you bother to train it). Training it on Swedish was more work than manually fixing the results, so I did not take that step, but was certainly tempted.
Since I had several PDF documents to process, and each had many pages, the above process was still too manual, so I wrote a utility Ruby script to do the work for me:
#!/usr/bin/env ruby

(ARGV.length>0) || puts("usage: ./ocr.rb file1.pdf <file2.pdf> ...") || exit(0)
ARGV.grep(/\.pdf/i).each do |pdf|
      dir = pdf.gsub(/\.pdf/,'')
      dir += '_OCR'
      dir += '.dir' if(dir == pdf)
      Dir.mkdir(dir) unless(File.exist?(dir))
      puts "Extracting pages from PDF: #{pdf}"
      system "gs -r300x300 -sDEVICE=tiffgray -sOutputFile=ocr_%02d.tif -dBATCH -dNOPAUSE \"#{$basedir}/#{pdf}\""
      tiff_pages ='.').grep(/^ocr.*\.tif$/).sort
      puts "Running tesseract OCR on pages: #{tiff_pages.join(', ')}"
      tiff_pages.each do |page|
              page_base = page.gsub(/\.tif.*/,'')
              print "#{page_base} "
              system "/usr/local/bin/tesseract #{page} #{page_base}"
      ocr_pages =^ocr.*\.txt$/).sort
      if ocr_pages && ocr_pages.length>0
              puts "Created OCR result pages: #{ocr_pages.join(', ')}"
              archive = "#{dir}.zip"
              puts "Creating archive of result pages: #{archive}"
              system "zip -r \"#{archive}\" #{{|p| "\"#{dir}/#{p}\""}.join(' ')}"
              puts "No OCR result pages found"
      puts ""

This script will extract the images to TIFF, run the Tesseract OCR on each page and finally build a ZIP file of the result with a filename similar to the original PDF. So MyDoc.PDF is converted to MyDoc_OCR.ZIP. Intermediate TIFF and TXT files are maintained in a subdirectory (MyDoc/*).

If, on the other hand, I simply did not look far enough and there are better utilities and GUI applications for this on Linux, feel free to comment.

Apr 11, 2008

Bongi's voice - and the size of the Kruger Park

My brothers blog 'other-things-amanzi', which is hugely more popular than mine, for obvious reasons if you read it, just got a boost as he was invited to be interviewed on Sid Schwab of surgeonsblog fame was a guest host, which worked well as he and 'bongi' got chatting about everything from the pleasures of living near the Kruger National Park, to regional issues facing surgeons, like the bad treatment sometimes delivered by the local witchdoctors or 'sangomas'.

Sid even mentioned my blog, but I'm guessing the ultra-geek content stopped him in his tracks :-). Of my blog, Bongi said: 'I don't understand a single word of it!' Perhaps this post will score better?

One thing Bongi said that I fear is some misinformation I might be responsible for, was that the Kruger Park is the size of England. I used to claim that myself, but recently decided to double check my facts and found I was WRONG! A quick google search reveals several sites claiming it is the size of Wales, and one claiming it is bigger than Ireland:
(Size: The Kruger Park is huge. It stretches for 350km (217 miles) from north to south and averages 60 kilometres in width which makes it bigger than Ireland. Most of the park is fenced so it is a self contained ecosystem.)

So, I stand corrected. So I decided to investigate and figure out what is really going on. How does the park compare to England, Ireland and Wales? Right now these are the facts I could find:
Kruger Park:
Area: 18 989 km2
Length: 350 km
Area: 130 395 km2
Length: 580 km
Area: 20 79 km2
Length: 215 km
Area: 84 412 km2
Length: 360 km
Area: 10 939 km2
Length: 115 km
Great Limpopo Transfrontier Park:
Area: 35 000 km2 - 99 800 km2 (planned expansion)
(England, Ireland, Wales lengths were roughly north-south measured by me on google earth 'ruler'. Kruger length is from wikipedia.)

So England is about 7 times the area, and about 65% longer. So the Kruger is comparable in terms of length (60% the length of England), but not by area, since the Park is so narrow. The website that claimed the park was bigger than Ireland is wrong. It's a bit shorter, and less than a quarter the area. It is, however 60% longer than Wales and nearly the same area, so that is the best match. If the full-size transfrontier park materializes, it will close in on the size of England, which is really impressive.

It is especially interesting to me that the Kruger Park is nearly twice the length and over three times the area of Skåne, the province in Sweden in which I live.

Apr 4, 2008

Hardy Heron Beta (and release)

I've had a generally good time with the Ubuntu 8.04 Hardy Heron Beta since it was released in late March, and have installed it on two different machines. I especially enjoyed getting compiz to work for the first time (possibly due to a new machine with better hardware, not something related to Hardy in particular). However, I have had three issues I thought worth mentioning here:
  1. admin-users does not work
    I reported this as a bug to ubuntu. Basically the problem is that no groups or users added with gnomes user administration tool actually get added, and in one case the tool crashed. I've had to add users and groups on the command line with tools like 'addgroup' and 'adduser'.
  2. eclipse crashes silently
    This happened several times before I thought to run it in the console and catch the error, which turned out to be 'java.lang.OutOfMemoryError: PermGen space'. Normally eclipse reports this to the user in a dialog, and I do not know why that was not done, but the solution is the same, add '-vmargs -Xmx1280m -XX:MaxPermSize=1024m' or similar to the eclipse launcher. I noticed the error first happen after switching from the Ruby to the Java perspective, and the virtual memory requirements of eclipse jumped from 0.5GB to 1.2GB. Amazing. (update: the symptoms returned on a new java6 update, and the new fix was to add -XX:CompileCommand=exclude,org/eclipse/core/internal/dtree/DataTreeNode,forwardDeltaWith to vmargs - see comments below for more details)
  3. unable to set HumanList theme for login window
    Once a change to the login window settings were made, logging out waited indefinitely (or in one case just about 30 minutes), before showing the login screen. No errors in the X log or any other log. Nasty. I had to kill gdm and hand edit gdm.conf-custom to remove the theme line.
  4. sudo fails with: unable to resolve host
    I found a lot of discussion about this on the internet, but in most cases it was due to people foolishly changing their hosts files. While sudo should not be sensitive to something like that, it was not my situation. I simply ran the usual daily upgrade with the list of updates for Hardy, and after the reboot this issue happened. With a lot of investigation I found how to 'fix' it, by getting /etc/hosts and /etc/hostname to have the same entry. Interestingly enough they do not have the same entry if you enter the 'obvious' values in the network manager applet for host and domain. For example, putting 'foo' and '' as host and domain will put 'foo' into /etc/hostname and '' on the line in /etc/hosts. My sudo continued to work for a week before my reboot because my DNS settings, and my 'search' list in particular allowed 'foo' to be resolved even without the '', but after the reboot it failed due to the additional issue below:
  5. static eth0 failed on reboot
    Strangely enough eth0 appeared on the ifconfig output, but with no IP address, while my network configuration in the network applet, as well as in /etc/network/interfaces, looked just fine. I needed to add 'auto eth0' to the config file to get it to work correctly. I vaguely remember seeing this before on much older ubuntu versions, but have not seen it for a while, so it was quite a surprise. This issue caused the sudo issue above to appear suddenly after a reboot.
The sudo and eth0 issue was tricky to deal with because a non-working sudo means you cannot access and/or edit the files you need to to get things working. I found reports of people rebooting to single user or recovery mode, and other booting the live CD to access the hard drive, both of which seem like an over-sized hammer for this small nail. One mentioned using 'aptitude' and the the menu to switch to root, but from there I could not get a shell from there. One mentioned using gksudo to run xterm as root, and that should work. I tried to used the network admin tool to fiddle settings until I got the hosts and hostname files to match. This was not easy because that tool did not allow simple hostnames (no domains) in the hosts file.

Mar 10, 2008

Rails authentication: restful_authentication and acts_as_state_machine

I have written a couple of rails apps with user authentication, but finally decided to start using some of the excellent plugins available for this. After a quick search I got the impression that restful_authentication was the current standard for rails (I'm using rails 2.0.2 at the moment), and especially with the link to the state machine plugin. However, my initial quick search did not yield a decent quick 'howto' for fast-tracking getting this all working. So I started writing notes on my findings here in my blog:
  1. Create your rails application:
    rails -d mysql myapp
    cd myapp
    rake db:create # you might need to edit config/database.yml first to match your db installation
  2. Install the two plugins required:
    script/plugin install
    # I needed to use trunk, as other versions have a missing const RailsStudio error
    script/plugin source
    script/plugin install restful_authentication
    # obviously these last two lines can be combined

  3. Create the users model and controllers:
    script/generate authenticated user sessions --include-activation --stateful
    # This will create the users model and the users and sessions controllers.

    It also adds map.resource entries for these in the routes.rb file. 'include-activation' is for email activation and 'stateful' is the tie to the state machine (for easily managing user activation and login status)

  4. Edit the routes.rb file to specify the user states and add some useful routes:
    map.resources :users, :member => {
    :suspend => :put,
    :unsuspend => :put,
    :purge => :delete
    map.resource :session
    map.activate '/activate/:activation_code', :controller => 'users', :action => 'activate'
    map.signup '/signup', :controller => 'users', :action => 'new'
    map.login '/login', :controller => 'sessions', :action => 'new'
    map.logout '/logout', :controller => 'sessions', :action => 'destroy'
    map.forgot_password '/forgot_password', :controller => 'users', :action => 'forgot_password'
    map.reset_password '/reset_password/:code', :controller => 'users', :action => 'reset_password'
    map.account '/account', :controller => 'users', :action => 'account'
  5. Edit the environment.rb file to include the line:
    config.active_record.observers = :user_observer
    This allows for activation emails to be sent.
  6. Edit the migration, in this case db/migrate/001_create_users.rb, and add lines for the 'forgot password' feature and the option to have administrator users:
    t.column :password_reset_code,       :string, :limit => 40
    t.column :is_admin, :boolean, :default => false
  7. Update the database:
    rake db:migrate
  8. Remove the following line from sessions_controller and users_controller and add it to application_controller to enable authentication application wide:
    include AuthenticatedSystem
  9. Remove or comment out these two lines from the UsersController.create method:
    self.current_user = @user
    This allows us to add further processing of the user registration request, by adding a create.html.erb view and email activation.
  10. Add the view/users/create.html.erb file with content similar to:
    <legend>New account</legend>
    <p>Instructions for activating your account
    have been sent to <%=h %>
    If this address is incorrect, please
    <%= link_to 'signup', signup_path %>
    again. If you do not receive the email
    soon, please check your spam filter.</p>
  11. Add the following helper methods to application_helper.rb:
    def user_logged_in?
    def user_is_admin?
    session[:user_id] && (user = User.find(session[:user_id])) && user.is_admin
  12. Add a 'forgot password?' link to the views/sessions/new.html.erb form (usually near the 'submit tag'):
    <%= link_to 'Forgot password?', forgot_password_url %>
  13. Add links to login/out and signup to your main page or layout. For example, I used a fixed position div like this:
    <div style="position: absolute; right: 0px; top: 0px; height: 20px;">
    <% if user_logged_in? %>
    <%= link_to 'Logout', logout_url %>
    <% else %>
    <%= link_to 'Signup', signup_url %>
    | <%= link_to 'Login', login_url %>
    <% end %>
  14. Support admin restrictions with the following method in the application controller:
    # Protect controllers with code like:
    # before_filter :admin_required, :only => [:suspend, :unsuspend, :destroy, :purge]
    def admin_required
    current_user.respond_to?('is_admin') && current_user.send('is_admin')
  15. If you want admin control, add a before filter to the users_controller to restrict key actions to admin users only:
    before_filter :admin_required, :only => [:suspend, :unsuspend, :destroy, :purge]
  16. Add actions in users_controller.rb for account, change_password, forgot_password and reset_password:

    def account
    if logged_in?
    @user = current_user
    flash[:alert] = 'You are not logged in - please login first'
    render :controller => 'session', :action => 'new'

    # action to perform when the user wants to change their password
    def change_password
    return unless
    if User.authenticate(current_user.login, params[:old_password])
    # if (params[:password] == params[:password_confirmation])
    current_user.password_confirmation = params[:password_confirmation]
    current_user.password = params[:password]
    flash[:notice] = "Password updated successfully"
    redirect_to account_url
    flash[:alert] = "Password not changed"
    # else
    # flash[:alert] = "New password mismatch"
    # @old_password = params[:old_password]
    # end
    flash[:alert] = "Old password incorrect"

    # action to perform when the users clicks forgot_password
    def forgot_password
    return unless
    if @user = User.find_by_email(params[:user][:email])
    flash[:notice] = "A password reset link has been sent to your email address: #{params[:user][:email]}"
    flash[:alert] = "Could not find a user with that email address: #{params[:user][:email]}"

    # action to perform when the user resets the password
    def reset_password
    @user = User.find_by_password_reset_code(params[:code])
    return if @user unless params[:user]

    if ((params[:user][:password] && params[:user][:password_confirmation]))
    self.current_user = @user # for the next two lines to work
    current_user.password_confirmation = params[:user][:password_confirmation]
    current_user.password = params[:user][:password]
    flash[:notice] = ? "Password reset successfully" : "Unable to reset password"
    flash[:alert] = "Password mismatch"

  17. Create html.erb forms in views/users for the change_password, forgot_password and reset_password actions.
  18. Edit models/user_mailer.rb and replace YOURSITE and ADMINEMAIL with values appropriate for the new website. A good way of doing this is to define the variable SITE in the config/environments/*.rb files and then use that in the strings in the UserMailer with the "#{SITE}" format. Also add methods for forgot_password and reset_password (ie. send mails when those actions are invoked):
    class UserMailer < ActionMailer::Base
    def signup_notification(user)
    setup_email(user,'Please activate your new account')
    @body[:url] = "#{SITE}/activate/#{user.activation_code}"

    def activation(user)
    setup_email(user,'Your account has been activated!')
    @body[:url] = "#{SITE}/"

    def forgot_password(user)
    setup_email(user,'You have requested to change your password')
    @body[:url] = "#{SITE}/reset_password/#{user.password_reset_code}"

    def reset_password(user)
    setup_email(user,'Your password has been reset.')


    def setup_email(user,subj=nil)
    recipients "#{}"
    from %{"Your Admin" <>}
    subject "[#{SITE}] #{subj}"
    body :user => user
  19. Add forgot_password.html.erb and reset_password.html.erb to the user_mailer view.
  20. Add methods to models/user.rb: forgot_password, reset_password, recently_forgot_password, recently_reset_password and recently_activated. Also add protected method make_password_reset_code.
      def forgot_password
    @forgotten_password = true

    def reset_password
    # First update the password_reset_code before setting the
    # reset_password flag to avoid duplicate mail notifications.
    update_attributes(:password_reset_code => nil)
    @reset_password = nil

    # Used in user_observer
    def recently_forgot_password?

    # Used in user_observer
    def recently_reset_password?

    # Used in user_observer
    def recently_activated?

    def make_password_reset_code
    self.password_reset_code = Digest::SHA1.hexdigest( {rand}.join )

  21. Modify UserObserver.after_save(user) to send activation only based on user.recently_activated? Also add to this method mail sending for forgot_password and reset_password events:
      def after_save(user)
    UserMailer.deliver_activation(user) if user.recently_activated?
    UserMailer.deliver_forgot_password(user) if user.recently_forgot_password?
    UserMailer.deliver_reset_password(user) if user.recently_reset_password?
  22. Make sure your mail subsystem is properly prepared to send mail. I installed postfix on my development and deployment machines (both ubuntu, so I used 'apt-get install postfix'). It is a good idea to test this with a command-line mail like:
    sendmail -f
    Subject: test

    Hello, world!
  23. Add appropriate administrative links. In my case I created an 'account' route to a new account action and view in the users controller, and in this displayed current user settings and provided a link to the 'change_password' action. Since this is very similar to many of the actions above, it is left as an exercise to the reader :-)
  24. Test everything, sign up a user, login, logout, click 'forgot password', respond to all emails sent, change the password, etc.
  25. What's next? Well, in my case I continued by adding a boolean 'is_admin' flag to the users table and then adding extra capability to my site for admin users. I also created a cool layout and used it for all controllers in my site. This is rails after all, the sky is the limit :-)

Mar 7, 2008

Even Microsoft is getting cool

When you think of 'cool' modern companies, names like 'yahoo' and 'google' spring to mind. For many 'apple' is also synonymous with cool. But Microsoft generally never gets that classification. 'Serious', 'business focused', even 'ruthless'. But take a look at the photo gallery of research projects from Microsoft's seventh annual techFest. Now that is cool!

Feb 19, 2008

IT back to business

A recent ComputerWorld article describes a 'new trend' in IT towards having business savy 'IT' people working within business departments instead of centralized generic IT personnel. I think this is a trend that started a while ago with pragmatic companies focusing on operational efficiency. Unfortunately not all companies are pragmatic, but hopefully more start to follow this trend.

I had a gripe with the way IT was moving in two previous companies I worked for. In the first, the IT department was standardized across the very large international organization, and focused on the common low-tech user, which was completely unsuitable for our high-tech development site, dramatically restricting our efficiency. We had to do our own internal 'skunk-works' IT, and hide the costs, in order to operate efficiently. My worst case horror story was the time it took 6 months and 5 engineers in 4 countries to install a local printer! (previously we had one local IT tech who would do it in a couple of hours).

The second company was a start-up which meant it had good pragmatic IT for a while, but as it grew, the new management tried to increase operational efficiency by 'centralizing' and 'standardizing' IT. Sounds good on paper, but simply does not work in reality. Costs increased and performance decreased due to the separation of IT from the people actually doing the business. People working towards true operational efficiency were marginalized and often left the company. By the time I left it was well on the way to the level of operational inefficiency of the large multi-national I worked with before.

Over the years I've developed a very strong feeling that IT must be integrated into the business. I'm thrilled to see a prominent article claiming this as an industry trend! So I end this blog with a nice quote:

"I want them to think of themselves as people who work for this company, not people who work for this company's IT department," he says. "We have an energy supply business to manage. That's our business, and we want to do it as efficiently as possible. It doesn't really matter what the IT job is."

Jan 23, 2008

Jazz, software development and the Sun/MySQL deal

Jazz - I kept hearing that word over and over the last couple of weeks, culminating in me finally joining YouTube and putting together a play list of jazz music videos. However, it all started with software development:
  • First IBM announced the partial open sourcing of their service. This was interestingly relevant to my current search for on-line software project hosting services. The site says: 'Developing software in a team is much like playing an instrument in a band. Both require a balance of collaboration and virtuosity. Jazz defines a vision for the way products can integrate to support this kind of collaborative work, and a technology platform to deliver on this vision.' Sounds great, but I'm not sure it is mature enough to replace my current top contender: However, considering the great job IBM did with eclipse, I'm certainly going to keep an eye on this new offering.
  • Then I read an article about the new trend in developer recruitment, calling top candidates 'rock star coders'. This trend was countered by some well written blogs asserting that it is better to be a 'jazz musician programmer':
    • 'I would rather be a jazz programmer' does a lovely comparison of rock stars and jazz musicians w.r.t. programming, emphasising creativity. I tried to paraphrase the key points here, but I think it just has to be read, so go take a look.
    • 'I'd rather be a DJ than a rock star developer' likes the jazz programmer idea, but prefers to be a DJ, emphasising code reuse and web-mashups.
    • 'Monks versus music' says that agile development teams are a lot like 'jazz bands'. This blog got me onto the YouTube jazz search, starting with Count Bassie and culminating with Nora Jones! In many videos artists joined forces to create a blend of music. I really like Nora Jones singing with Ray Charles, what a combo!
  • Sun's acquisition of MySQL - OK, no jazz mentioned there, but with jazz on my mind I saw a fit. It is yet another interesting merger of talents, the mature Sun trying their hand at the new world of open source, partnering with the younger, hipper MySQL with a very solid open source presence. It reminded me so much of the YouTube video of Ray Charles being introduced by Johnny Cash and then singing Johnny's song 'Ring of Fire' with a strong jazz slant. Let's see if MySQL will jazz up Sun's product offerings, or will Sun dull down MySQL? (considering Sun's recent moves towards open source, with Java and Solaris, I'm betting on the former, which is great).