Aug 12, 2014

Neo4j read-only transactions still need success()

Since Neo4j 2.0 it has been necessary to wrap read-only database access in transactions just like you needed for write access in 1.X. This has necessitated refactoring of much legacy code, adding transactions around code that previously did not require it. It turns out that there is a subtlety to this, a case when it is easy to make a mistake that does not necessarily cause any immediate problems.

Let's first explain a common pattern that leads to this issue. Consider a method with multiple return statements:
public String getName(Node node) {
    if(node.hasProperty("name")) {
        return node.getProperty("name");
    } else {
        return null;
This code is a bit contrived, but it does represent a common pattern, multiple return statements in code that does read-only access to the database. I've seen much more complex cases, but this one suffices to demonstrate the problem. So, what happens if we add a transaction? The easiest is to wrap all code in the method:
public String getName(Node node) {
    try (Transaction tx = node.getDatabase().beginTx()) {
        if(node.hasProperty("name")) {
            return node.getProperty("name");
        } else {
            return null;
You might immediately notice I left out the tx.success() call. The reason I did this was two-fold:
  • I would need to add multiple calls, one before each return.
  • If I follow the pattern of calling tx.success() after all database access, I should create a new local variable for the result of the getProperty() call making the code mode complex.
  • It turns out that this code runs fine, so why bother with the tx.success() call?
This last point is the subtle catch. While this code seems to run fine, there are cases when it really does not run fine. Before explaining those cases, let's just look at how the code would look if we were strict with the tx.success() call.
public String getName(Node node) {
    try (Transaction tx = node.getDatabase().beginTx()) {
        if(node.hasProperty("name")) {
            String name = node.getProperty("name");
            return name;
        } else {
            return null;
Clearly this is noticeably more verbose than the previous code. And since the previous code worked, it is extremely tempting to code like that. And I did. And all hell broke loose and it took a while to figure out why.

It turns out that while a single transaction like this does run fine without the tx.success() call, it does not really work with nested transactions. If you have other code calling your method, and that code also has transactions, and those transactions remember to call tx.success(), then an exception will be raised. The inner transaction, even though it becomes a dummy transaction, will mark the transaction as not having completed successfully, and if the outer transaction does mark success, the database will reject this case, and throw a:
 - Unable to commit transaction
with cause:
 - Failed to commit, transaction rolled back
and no further information.

When I first saw this, I could not understand where it was coming from. I was in the process of updating legacy code by adding more read-only exceptions, and successfully fixing several unit test cases. However, I found that after fixing some unit tests, other apparently unrelated unit tests started failing with this exception. It occurred because the first unit tests added transactions deeper in the call stack, the same methods used by other tests that had already added transactions higher in the stack. Normally it does not matter if you nest transactions. This is a common pattern. So it was not obvious, at first, that the nesting would cause the problem, but it sure did.

To clarify the cases when this problem occurs I wrote a little test code:
private void readNamesWithNestedTransaction(boolean outer, boolean inner) {
    try (Transaction tx_outer = graphDb.beginTx()) {
        try (Transaction tx_inner = graphDb.beginTx()) {
            if (inner) {
        if (outer) {
This method can be called with two booleans, which control whether tx.success() is called in the inner transaction, outer transaction or both. It turns out only one of the four possible combinations will cause an exception (tested against Neo4j 2.1.2):

The cases where the inner and outer are the same, no exception is thrown because there is no inconsistency. If they are both true (success called for both), the transaction succeeds cleanly. If they are both missing the success call, then the entire transaction rolls back, but since it is a read-only transaction, nothing will actually be done. If the inner transaction is true, and the outer false, the outer ignores the success of the inner transaction, and decides on a full rollback, which again does nothing. However, in the specific case when the inner does not have success(), but the outer does, it marks this as an invalid case, and throws this exception despite being read-only.

My conclusion? While it is not really wrong to skip the success() on read-only transactions, it is certainly a very bad idea. Perhaps you can get away with it on the outer-most main loops of an app, when you are really certain you will not be called by something else, but be very, very wary of forgetting the success call in any methods you expect to be called from elsewhere. Either have no transaction at all (rely on the calling code to add transactions), or accept the higher verbosity that comes with writing the code right.

May 10, 2014

Using GoPro Time-Lapse Photos for Mapillary

Over the last half year mapillary has grown in popularity with over half a million photos uploaded, more than 100k in the last ten days alone! And this is despite the fact that at first glance it feels like you have to use a normal hand-held smartphone to do the job. What if you could use a mounted action camera, like a GoPro to take photos for mapillary? Mounted on a bicycle, helmet, even a car? It turns out you can, and last week I did a little drive and collected about 1700 photos in half an hour using TimeLapse, consuming only 3.4G of my 32GB SD. I could have driven for over four hours and taken around 150k photos before filling my card!

However, it is not entirely trivial to do this, and so I thought I'd explain a bit about how it is done. The main problem is that the GoPro has no GPS. How then can the map know where to place the photos? It cannot. We have to tell it, and we have to use a separate GPS to do so. In my case I've been using my HTC One smartphone, which I also previously for normal mapillary photos. In brief the procedure is:
  • Mount the GoPro somewhere, on a bicycle or car using the various mounting accessories available.
  • Start a GPS tracking app. I used geopaparazzi, but there are many others that will work just as well.
  • Configure the GoPro to take time lapse photos. I choose 5MP, medium, narrow, 1s.
  • Start the GoPro and go for a drive or cycle tour.
  • Upload the photos and GPX track to a computer.
  • Geolocate the photos using time-correlation to the GPX track using gpx2exif.
  • Upload the photos to Mapillary.
OK. That doesn't sound too difficult, right. So let's describe these steps in more detail, using the drive I made last week.

I mounted the GoPro onto the front left fender of my Mazda RX8 to get a nice view of the center of the road.

Then I mounted my HTC One into a dedicated HTC mount inside the car. Using the 'GoPro App', I could connect to the camera and get a preview of what the GoPro can see. This is convenient, but not really necessary, as the preview disappears as soon as the recording starts. It is sufficient to just have the phone lying in the car somewhere, as long as it gets a good GPS signal. If you do use the GoPro app, take the opportunity to go to the settings and set the camera time to minimize the time difference between the two devices. As you'll see later, you will need to correct for differences in order to perform a correlation, and the smaller the correction the easier the job.

Make sure the GoPro is configured for time-lapse photos with the photo settings you like. This is easy to achieve using the GoPro App, but can also be done on the GoPro itself. I chose 5MB, Medium, Narrow to get a view I felt was a somewhat similar to what I would see with my phone, with a focal length of approximately 20mm compared to a 35mm camera. The GoPro defaults to very wide angle views with a lot of distortion, so I avoided that for this drive. In another blog I plan to describe an alternative scenario where I got mapillary photos from a wide angle 4K video stream. That was more complex, so we'll skip that for now. With time lapse I selected a 1s time gap to give about 1 photo every 10-20m for speeds between 40 and 80 km/h. The mapillary default of one photo every 2s is more suitable for cycling, and I was certainly planning to drive faster than I cycle!

Start the GPS tracking app. In my case I used geopaparazzi. In this app there is a button for starting a recording. I clicked that button and accepted the suggested file name for storing the track. OK. Now we all ready to go. Just to start the camera recording and drive!

When you have finished the drive, stop the recording of both the camera and the GPS. Now the real work starts. We need to perform a geolocation before we can upload to Mapillary. In Geopaparazzi, I exported to GPX, then used a file manager app to find my GPX file and email it to myself. For the GoPro I simply pulled out the SD card and plugged into into the side of my laptop and copied the photos over to a local directory.

The first thing I wanted to do was see what the drive looked like, so I ran:
geotag -g väla_to_billesholm.gpx \
       -o väla_to_billesholm.png \
       -D 1 -s 2048x2048

This produced a nice high res view of the entire map. Note the use of the -D option to put a big gap between the marker labels. This is necessary because the geotag default is set for much smaller tracks, like a quick bicycle ride. From the image produced, we can see the times and locations for key landmarks on the drive. We will want to zoom in on a few distinctive areas we can use to manually check that the camera and GPS clocks are synchronized, or determine exactly the error between them. We can correct for this error when correlating, so it is important to get it right.

I ran the command:
geotag -R 20140505T12:39:00+02-20140505T12:41:00+02 \
       -g väla_to_billesholm.gpx \
       -o krop.png -D 0.1

which produced the image above. I can look for a photo when going under the highway bridge and confirm the times.

The EXIF data for this photo shows it was taken at 12:39:11. When looking at the map, we can see that we passed under the tunnel at 12:39:14. So we have a 3s error. We can use this in the geolocation process, but let's double check with another photo.

I created a map of the track through Mörarp, because I could identify features like buildings and intersections. It is a bad idea to use any intersection that you might have paused at, like I did at the right turn above, but look for landmarks along the drive where you were sure to be moving. I looked at the first road to the right, near the top of the map, and found the following photo at 12:46:10.

I've taken note of things I know to be at the specified location, the speed warning sign in the middle, the white marker on the right for the intersection, and the tree and lamp-post on the side.

One interesting double-check you can also do if you happen to be in an area covered by google street view, is visually compare those images.

In google street view we can see the speed warning, the intersection marker and the tree and lampost, but notice that the fence is not there an fir trees can be seen instead. Clearly someone has been building since the google car passed! Google said the image was captured in September 2011, which is about 2.5 years ago, so things can certainly change.

On the GPX track of Mörarp, we see this intersection was passed at 12:46:13, which is 3s after the camera image timestamp. Again we have a 3s offset. This is good news. It appears the entire track was off by the same amount. We can go ahead and correlate all 1500 photos using the command:

geotag -g väla_to_billesholm.gpx \
       20140505_TimeLapse/*JPG -t 3 -v

I set the time offset with the '-t -3' option, and used '-v' to watch the progress. Since the script is actually wrapping the command-line tools 'exif_file', a new process is started for editing each file, and this can take a while, but in the end your images will have GPS information in the GPX.

Once the images are geolocated, then you can upload them to Simply login, then click on your name, choose 'upload images', then click the 'choose files' green button. Once the files are all listed, scroll to the bottom and click the 'Start Uploading' button. It will go a slightly paler green, so it is not that obvious that the uploads have started. Just scroll to the top again and you should see thin red progress bars below each image as they are uploaded.

And finally, once the upload is completed, click your name and select 'my uploads', and you should see a new image for your track.

Click you most recent upload to browse the photos in mapillary!

It might take time for the photos to be completely processed, so don't worry if they are not immediately available. Just come back and look a little later. In the meantime there is something else really fun to play with!

Time Lapse Video

The GoPro settings for taking these photos were called 'Time Lapse' for a reason. They can be used to make a video. Since we recorded one frame a second, if we make a video at 25fps, we will have a 25x speedup. This is pretty cool! See what I made below....

This video was made using the following process:
  • Rename all photos to have names with numbers starting at 0000. In my case I used foo-0000.jpeg. To make life easy, I wrote a Ruby script to create hard links with the appropriate names. Then you can use the ffmpeg command to make a video like this:
ffmpeg -f image2 -i foo-%04d.jpeg \
       -r 25 -s 1280x960 ../myvid.mp4
  • This command compresses the 7MP 4:3 images to 960p HD video.
  • Then I used OpenShot video editor to trim to 16:9, add audio, add the map, and compress to 720p medium quality for faster web uploads.

Installing gpx2exif

In this blog we made extensive use of the geotag command. This command is included in the 'gpx2exif' Ruby Gem. For this blog we made use of features available in version 0.3.6. However, at the time of writing, the published gem was at version 0.3.1. So I will explain both the easy way to install (assuming we get 0.3.6 published before you try this) and the slightly harder way (if 0.3.6 is not published, or you wish to make changes to the script yourself).

Installing Ruby on Ubuntu

The first thing you need is Ruby. This depends on the OS. I use Ubuntu 14.04 and RVM, so I recommend you refer to and for advice on your own platform. Here I'll give instructions for my OS, so you will need to make some adjustments to suite your platform.
sudo apt-get install curl
curl -sSL | sudo bash -s stable --ruby
sudo usermod -g rvm craig
# logout and login to get RVM group
source /etc/profile.d/

The code that creates the png images above makes use of ImageMagick. On Ubuntu at least this means you need to install a few dependencies first:
sudo apt-get install imagemagick imagemagick-doc libmagickwand-dev
gem install rmagick # Requires libmagickwand-dev to compile

Installing from

Once ruby is installed, simply install the gem:
gem install gpx2exif

Then list the installed gems with 'gem list' to see which version you got. If it is not 0.3.5 or later, then use the instructions below.

Installing from github

Install git, and then run the command:
git clone
cd gpx2exif
bundle install
rake build
gem install pkg/gpx2exif-0.3.6.gem

If all goes well, you will have built and installed the latest Ruby Gem.

Have fun!

Apr 8, 2014

Paging huge cypher requests

Recently I typed a simple cypher command into the Neo4j browser that effectively 'killed' the server. By this I mean the server process went to 100% CPU usage and remained there. It became unusable. In retrospect I should have expected this, since the command I typed was going to hit about 80% of a 4.5 million record database - all in one transaction!
MATCH (c:GeoptimaEvent)
  SET c :ConfigCheck
  RETURN count(c)
This command finds all nodes labeled GeoptimaEvent and adds the label ConfigCheck. My goal was to change all node labels, first by adding the new one and then by removing the old one. But what happened instead was that the server started struggling to allocate memory to hold the entire 4M change transaction. No other requests to the server could be handled. Luckily Neo4j is resilient against failure. I simply needed to restart the server to get back to my previous state.

So, how do we do this properly?

I was giving a great suggestion by Jacob Hansson to split the work into blocks, and Ian Robinson who pointed out that the SKIP and LIMIT statements can apply to the WITH clause. This lead to the following command:
MATCH (c:GeoptimaEvent)
  WHERE NOT (c:ConfigCheck)
  WITH c LIMIT 1000
  SET c :ConfigCheck
  RETURN count(c)
See how this command will only change 1000 nodes, and only those that have not already been changed. This is achieved by first streaming (c:GeoptimaEvent) nodes through the filter WHERE NOT (c:ConfigCheck), taking only the first 1000 off the top of the stream using WITH c LIMIT 1000, and then applying the SET c :ConfigCheck label to the nodes. We return the number of nodes changed. By repeating the command until the result returned is zero, we can change all the nodes.

This command took only a few hundred milliseconds on our standard server (16GB RAM Quad Core i5-4670K, Neo4j 2.0.1 default installation in Ubuntu 12.04 LTS). However, we would have to repeat this command about four thousand times to change the entire database, so let's figure out how big we can go with our transaction size before performance becomes a problem.

By trying the same command with different LIMIT # settings, we can see that the performance scales nice and linearly up to around 400000 records. After this, things get noticably slower, and after 600000 nodes it gets really bad. What is happening here is that GC is kicking in. And if you go for large enough transactions you could exceed the maximum heap size.

We only needed to repeat the command ten times with a transaction size of 400,000 in order to change all 4,000,000 nodes. I was happy to do this in the Neo4j browser. If I needed to repeat the command many more times, I would have written a Ruby script and used Neography.

Now that we've added the new labels, we can remove the old ones by repeating the following command until it returns 0 changes:
MATCH (c:GeoptimaEvent)
  WITH c LIMIT 400000
  REMOVE c :GeoptimaEvent
  RETURN count(c)

A more complex example

The above case was pretty simple really. However, since then I've been faced with a much more challenging example, bulk copying properties from one part of the graph to another.

Let's start with a little background to the problem. The database above is actually a tree structure, with the leaf nodes representing the requests made to a web service. We wanted to data mine the Apache2 logfiles, and in order to calculate high performance statistics we build a tree structure with parent nodes representing the kinds of aggregations we would like to make. We imported the data using the csvtreeloader at, leading to a tree structure like:
  -[:versions]-> (v:GeoptimaVersion)
    -[:days]-> (x:EventDay)
      -[:checks]-> (c:ConfigCheck)
We can ask queries like "How many service checks were there per day during March?":
MATCH (x:EventDay)-[r:checks]->()
  WHERE >= "2014-03-01"
    AND < "2014-04-01"
  RETURN as Day, count(r) as Checks
This command returns very quickly and is used in dynamic websites providing instant statistics on the activity of the service.

The problem I faced was that some critical information about the service check, the PLMN code of the device making the check, was saved at the (c:ConfigCheck) level. Considering that we had about 10,000 devices and 4,000,000 config checks, any query on PLMN would hit 400 times as many nodes as needed. We needed to move this critical information up the tree. However, this is not trivial, because the most obvious command to do this will read all 4,000,000 ConfigCheck nodes and copy repeatedly the same information:
MATCH (v:GeoptimaVersion)-->(x:EventDay)-->(c:ConfigCheck)
  WHERE NOT has(v.plmn)
  SET v.plmn = c.mcc+'-'+c.mnc
  RETURN count(v)
This command has two problems:
  • It will read all 4,000,000 ConfigCheck nodes in one transaction (same problem as before)
  • It will set the same PLMN code over and over on the GeoptimaVersion node (wasted effort)
We can fix both problems with the following interesting command:
MATCH (d:Device)-->(v:GeoptimaVersion)
  WHERE NOT has(v.plmn)
  WITH d, v LIMIT 1000
  MATCH (v)-[:days]->()-[:checks]->(c)
  WITH d, v,
      x in collect(
        distinct replace(c.mcc+'-'+c.mnc,'"','')
      WHERE NOT x =~ '.*null.*'
    ) as plmn
    LIMIT 1000
  SET v.plmn = plmn
  RETURN v.version_name, v.plmn, plmn
The key points in the above command are:
  • We search first for the Device and Version, filtering for ones without the PLMN and using blocks of 1000. Since there are 10,000 devices, this command only needs to be run about 10 times. Each of these will hit about 10% of the database.
  • We search for all ConfigCheck events that each Device has and for each we apply the filter() method to combine them all into a single value for that device.
  • We finally set the value to the requisite parent node and return the results.
These commands each took about 20s to run on the same server. Considering how much is going on here, this is quite impressive performance, I think.

One part of the above command deserves a little more explanation. The construction of the PLMN. We called the following function:
  x in collect(
    distinct replace(c.mcc+'-'+c.mnc,'"','')
  WHERE NOT x =~ '.*null.*'
) as plmn
What this does is:
  • Combine the properties c.mcc + '-'  +c.mnc
  • This string contained double quotes, for example my own device has '"240"-"06"' and I expect to see '240-06' for Telenor, Sweden. We use the replace() function to remove the double quotes.
  • Then we reduce the set to distinct values only using distinct()
  • And use collect() to make a single set of all matching results.
  • And filter() to remove entries with the value 'null' in them.
This was a complex example, indeed. Took a while to figure out. The fun part of it, though, was that this could be done through a little trial and error in the Neo4j Browser. Once or twice I typed in a command that hit too much of the database, but a quick restart later and I was back on track. Once it worked, it worked well, and I was able to migrate the entire database quite quickly.

Jan 16, 2013

Unwanted side effects

Last week the performance of Neo4j Spatial improved by 100 times! What the..? Can this be real? Does it mean Neo4j Spatial was always 100 times too slow? Of course, the truth is more subtle than that. Firstly, it was only the performance of the 'add geometry' function that changed, and only for the newer IndexProvider API.

Take a look at the chart above. The yellow and green show performance starting at around 300 geometries/second improving quickly to over 1000 nodes/second on my laptop. This was after applying Axel Morgner's bug-fix. Before the fix, the blue and red lines show the performance dropping quickly to as low as 10 geometries/second once the index contains around 20000 geometries. That is way too slow. In fact the performance was only acceptable for the first 1000 geometries or so, only good for small cases, perhaps the test cases. And that is the first hint as to why this was not noticed until recently.

The idea that a bug fix can break something else is not new. And code that works reliably in one place can fail miserably in another. Sometimes the results can be quite dramatic, as we see above. So what really happened here? It all started with a bit of old code that chained all geometry nodes together in a long chain. This code was not really needed, except by one obscure test case, and one unused API method. It was originally coded as a trivial example of a domain model, since Neo4j Spatial allows you to add an index to any graph structure you wish. But as time went by this model became the default case for whenever users did not have their own domain model, which is all of the time when using the higher level API's. And, of course, no-one bothered to remove it, because it had no obvious harmful side effects. Yet!

And then two things happened that are related to this code. Firstly, in 2011 Peter Neubauer developed a new API on Neo4j Spatial enabling the use of the Spatial Index through the standard Neo4j index API. This new code wanted to check against adding the same node twice to the same index. The easy solution to this was to use the old code, the old long chain of connected geometry nodes. This was tried and tested code, a tried and tested data structure, so it should be good, right? Wrong!

Then in September 2012 Axel Morgner noted that under some circumstances the old code caused NullPointerExceptions. This occurred if there was an error during the transaction that added the geometry. This error could occur in any application level code, and would simply roll back the transaction, which should be fine, but not with that old code. There was a static reference to the end of the geometry chain, a variable called previousGeomNode, that would end up pointing to a Node that had been rolled back. The simple fix was to not keep this reference at all. But that meant that each addition would take a little longer than the previous, because it would have to traverse the chain to find the end before adding to it. You can see quickly that this will escalate to a serious problem. After some upgrades Axel found he could not reproduce the bug, so it was never fixed because the obvious fix would cause a more serious bug.

And then the real problem came. Not from the NPE, but from the shift of users from the old API to the new one. It turned out that the new API was using a version of the long-chain code that did not have the NPE bug. No static reference. Nice clean code. But that meant it should have the expected performance problem. And it sure did. As more users started loading larger datasets, the performance of Neo4j Spatial fell through the floor.

In November 2012, Jonathan Winterflood noticed that performance dropped dramatically as more geometries were added through the IndexProvider API, and submitted bug report 72. The problem was discussed on the forums and others noticed this problem too. Rune Sörensen said "this is a show stopper for spatial data". And then the community came to the rescue. Jonathan submitted a pull request for a good test case demonstrating the problem. Then Axel Morgner submitted a series of pull requests fixing the problem. Finally Craig merged in the fixes and did some tests on the old and new code, resulting in the dramatic chart above.

But this was not all. I looked deeper at the problem and felt that the old 'long chain' code was the real problem here, and while Axel's fixes removed its used from the new API, it was still lying there tempting others to see what bugs they could write in future. So I removed this code almost completely. All standard code, test cases and formal API's no longer use this code. The consequence is the geometries are no longer ordered by insert order, something no-one has requested (yet). But just in case they do, or they already rely on this behavior in the old code, a 'new' class now exists called OrderedEditableLayer that uses this old code, in a small, self-contained way. Developers have to explicitly make use of that code if they want this behavior. And right now the only code I know that still uses this is the ShapefileImporter, and then only if you pass a special flag to enable geometry ordering.

OK, so all is well in spatial again. But before ending, I wanted to give a quick overview of the various API's mentioned above. Let's move away from the abstract discussions on side effects and get into some actual code. Here are some examples, each using a different one of the various API's mentioned above.

IndexProvider: compatible with other Neo4j indexes

Let's start with the newer API because it was the one with the bug. This API does not give you access to the full feature set of Neo4j-Spatial, but is convenient because of its compatibility with other Neo4j indexes. In the example below we show how to create an index, add geometries to it and then query for whether they are inside a polygon:

// Create the index
Map<String, String> config = SpatialIndexProvider.SIMPLE_POINT_CONFIG;
Index<Node> index = db.index().forNodes("layer1", config);

// Add the geometry
Transaction tx = db.beginTx();
Node n1 = db.createNode();
n1.setProperty("lat", (double) 56.2);
n1.setProperty("lon", (double) 15.3);
index.add(n1, "dummy", "value");

// Query the index
IndexHits<Node> hits = index.query(
    "POLYGON ((15 56, 15 57, 16 57, 16 56, 15 56))");

This example was extracted from the code in IndexProviderTest, which contains many more examples for how to interact with the index.

SpatialDatabaseService: the original API

Here you can access the RTree in a very flexible way, and can also plugin your own GeometryEncoders and do all kinds of wild things with Neo4j Spatial. For the purpose of comparison, I'll focus only on that part in common with the other API, adding and querying simple point geometries.

// Create SpatialDatabaseService on existing database
SpatialDatabaseService geo = new SpatialDatabaseService(db);
EditableLayer layer = (EditableLayer) geo.createSimplePointLayer("test", "lon", "lat");

// Add the geometry
Transaction tx = db.beginTx();
Node n1 = db.createNode();
n1.setProperty("lat", (double) 56.2);
n1.setProperty("lon", (double) 15.3);
layer.add(n1);    // Compare this to the index.add(node,key,value) method above

// Use GeoPipeline to find geometries around point
List<SpatialDatabaseRecord> results = GeoPipeline
  .startNearestNeighborLatLonSearch(layer, new Coordinate(15.3, 56.2), 1.0)

This code is using the GeoPipeline API to perform the search. This is a useful approach because GeoPipeline can actually chain many operations together, resulting in quite complex GIS queries performed in a streaming, and effecient way. Look at the examples in (for point data) and for more complex examples.

Oct 29, 2009

One liner multi-image cropping

A simple entry for a simple way to solve a common problem.

I often have a lot of scanned images, documents usually, that have unnecessary white-space, and other artifacts around the edges due to limitations in the scanner software or issues with the scanner itself. Normally I use the gimp, and can correct orientation problems, colour issues and cropping very easily. But this is too much mouse work in my opinion when you have 50 images and all need a similar simple cropping. So I fiddled around in the shell a little and came up with this one-liner:
for file in *.jpg ; do echo "Cropping $file" ; \
convert -trim $file trim/$file ; \
convert -crop 1692x2190+4+4 trim/$file crop/$file ; done

OK. Not really one line, but I swear I did write it on one line :-)

What it does is three things:
  • loop over all image files, and for each:
  • trim all white-space off the edges, bringing all images to the same basic size with the top left of the real data at the top left of the image
  • trim them down to a specified size, in this case a little smaller than US letter size. I cut 2 pixels off each edge just to remove some noise commonly found on the edges of scanned images.
And here is a simple set of thumbnails of the results:

original image with many artifacts

first trim removes continuous whitespace

final cropped image at letter size with edges removed

It took about 10 minutes to figure out this script, 20 seconds to run it on all 50 images, and then 20 minutes to write the blog!

Oct 15, 2009

Complexity threshold

I've been planning to write a proper article about this for years, but inspired today by Dan Pink's excellent presentation on the surprising science of motivation, I thought I'd write a mini-version of this, as an introduction and a reminder to myself to write the real one.

Back in 2006 I wrote a blog titled 'The perception of control', dealing with uninformed decision making and the illusion of efficiency, commonly found in control-based companies. Allan Kelly gave a detailed and informative response, including the statement "The fundamental problem facing developers is Complexity. Attempting reuse simply increases complexity."

This got me thinking, and I remembered a long series of interesting experiences I had regarding attempts to scale up development teams, and the very mixed results I got. These days most people know about 'The Mythical Man-Month', but I had experienced many occasions where I needed to argue this point with people that did not. I'm embarrassed to say I was one such person for a while, and needed to learn the hard way myself. One advantage of learning the hard way is that you are forced to try to explain what happened, to yourself and others. So I though about it, and noticed a number of interesting correlations:
  • Adding people to projects works to some extent for very small teams, but can quickly get out of hand. It seems as if a threshold is crossed after which the project becomes a death march.
  • Assigning tasks of varying levels of complexity to a specific developer has a remarkably similar threshold. Their performance is consistent until they pass some level of complexity, after which the task goes _pear shaped_.
  • New developers on old code can perform well if the code is relatively simple, and perform extremely badly if it is not. Most critically, code that is twice as complex does not half the performance. The effects can be much worse. Again there seems to be a threshold.
How do we find, measure or manage this threshold? It is not simple, but there are a few factors to look at to help us analyse the situation. These involve the characteristics of the developer, the team and the project:
  • Each developer has their own threshold, after which the complexity becomes hard to manage. Obviously this threshold is not fixed, and depends on many factors, including their state of mind. Dan Pink's presentation demonstrates an interesting effect where increasing motivation decreases performance for creative tasks. He says increased focus can decrease problem solving skills. I agree, and see a correlation, since decreased problem solving skills means the developer will have a much lower _complexity threshold_.
  • The team has a threshold, very strongly correlated to the level of communication in the team, and most of the Agile software development approaches have a lot to say about how to deal with this problem. Usually they try to increase communication and couple that to catching problems early.
  • The project can obviously range from simple to very complex. Many approaches to dealing with this complexity are to split the project into modules and build teams on each, but in many cases this merely replaces one type of complexity with another. The more modern solutions to this involve prototyping, re-factoring, TDD and BDD.
In projects that I manage, I believe we need to deal with all three areas. However, I have a personal bias towards focusing on the first, largely because I'm particularly interested in individual behaviours. I think this helps me also deal with the second issue, because if you have a feeling for the individuals, you can use that to help improve communication in the team. On the third point, I have a solid track record of prototyping and refactoring. I am also a believer in TDD and BDD, but must confess I can do with some improvement in those areas.

Since this blog is merely an introduction to this idea, I will end off with a short piece on how I think we can help individuals keep on the _good side_ of their personal complexity thresholds. I have normally focused on two approaches, but Dan Pink has shown me a really nice new take on this. Let's start with my classic view:
  • Identify the threshold and avoid it. Ok, this sounds like a cop-out, but what I'm talking about is assigning tasks to team members based on their natural skills, and so reducing the risk of crossing the threshold. Of course, one of the best ways to do this is to get the developers to pick the tasks themselves, as common in several Agile approaches, like Scrum. This is a good start, but does not always work, because while some developers pick the easy tasks and perform well, others are suckers for a challenge. So sometimes you need to advise them. An external opinion can make all the difference. If you are the developer yourself, learning to identify when you're in trouble is a valuable skill, but you will also get help to do so in an Agile team, especially if you use pair programming.
  • Adjust the developers threshold. Ok, this one sounds hard. Not really, when you realize that this might be as simple as additional training. But one area that really matters is helping the developers identify the thresholds themselves and respond to them. It can be very hard, when stuck in a complex problem, to see outside the box. Again, working in an agile team really helps here. Getting a second opinion on something, even if you might not realize you need it, can make all the difference.
Dan Pink's presentation really builds on this second option. Instead of trying to improve performance with classic rewards, like bonuses, we focus instead on intrinsic motivators. Dan describes three:
  • Autonony - the freedom to define your own life. In my world that can mean something as simple as allowing the developers to pick their own tasks. I also see this respecting their opinions in design meetings. In fact anything that tries to break the control-based management I described in 'the perception of control' is a good start.
  • Mastery - the desire to get better and better. Obviously this can mean training, but coupled to the first point it can mean the option to develop ones career in the direction that one has a passion for. For more on the use of the word 'passion' in this context, I advise reading 'The Hacker Ethic and the Spirit of the Information Age'.
  • Purpose - the yearning to do what we do in the service of something larger than ourselves. Ok, being a fan of open source, this sounds great. Of course it is both more subtle and more applicable than that. People want to do something good, or something of note. In a software project this can be as simple as assigning key features to developers. Someone constantly getting the boring stuff is bound to perform worse.
I found Dan's presentation very interesting. But is this really related to the problem of complexity? I believe it is. If low performance can be attributed to crossing the complexity barrier, then these techniques can be used to move that barrier.

Dan finishes with a introduction to ROWE - the 'results only work environment'. I've been playing with a few ideas in this area, and I'd love to report on them, but need to do that in another blog :-)

Jul 31, 2009

Boolean behaviour

I recently spent a good half hour debugging some grails unit test code, only to track the problem down to groovy's boolean behaviour. As a Ruby programmer I've become spoiled by the clean and simple predictability of Ruby booleans, and since groovy is visually so very 'Ruby'esque' I was only too easily deceived.

So, I've made a little summary of different programming languages boolean behaviour. IMHO There are only two modern languages with simple boolean Rules, Ruby and Java, and the rest are unreasonably complex:
  • Java - no coercion, only Boolean/true/false are valid
  • Ruby - no coercion, only nil and false are false, all else is true
  • Python - no coercion, a ton of rules for deciding what is true/false
  • Groovy - coercion to true/false, with a ton of rules for deciding which way to go
  • Perl - no coercion, a ton of rules for deciding what is true/false
  • C - no coercion, 0 is false, all else is true
What a mess! Every single one is different. But this is my opinion, I judge these on the basis of a few criteria:
  • Simple rules, in which case Java, Ruby and C rank high
  • Simple syntax, in which Ruby and C rank high
  • Enables expressive code, in which Ruby ranks high, and Python, Groovy and Perl do quite well (and Java does very badly)
OK, you guessed it, I'm a Ruby fan. But let me justify my opinion with one simple meaningful example. I'll show a series of statements that do the same thing:
a = (a == nil) ? 'default' : a
a = a || 'default'
a ||= 'default'
AFAIK, this is possible as a direct result of the simple boolean logic, and in particular the fact that everything is true except false and nil.

The groovy book I read claims similar behaviour, but it in fact is not true. The book claimed the following equivalent statements:
a = (a == null) ? 'default' : a
a = a ?: 'default'
And if you have a=nil (a=null in Groovy) both the Ruby and Groovy code behave the same. But just try pass in a='' (empty string), or a=0. Ruby will keep the assignment passed in, while Groovy will re-assign to 'default', not what you expected, and not what the Groovy book claimed.

The problem here is that Groovy, Perl and apparently Python, decided to try make developers lives easier with some convenience rules for booleans, notably that integer 0, empty strings and empty collections are all seen as false. And honestly, there are many scenarios that is useful, and I've used that fact for years in Perl. And when I first started Ruby coding, I balked at the idea that 0 and '' were true. But it did not take long to see the light. And then I began to remember the pain I had debugging Perl code where the bug was due to an unexpected false when an operation returned a numerical zero or an empty string.

Sorry, I'm convinced. Ruby got it right!

And the remaining question is: since Groovy clearly copied a lot of Ruby syntax, why did they not do it the Ruby way with booleans? Actually I think the answer is obvious once you think about it. Groovy is actually Java inside. Groovy tries to bring nice modern dynamic scripting capabilities to the Java language. Quite a paradox that Java, with the most rigid, predictable boolean behaviour, and the easiest debugging of the lot, should end up with this kind of scripted boolean. What I believe happened is that Groovy decided, quite naturally, to go with the coercion approach to scripting Java. Deep down it is all Java with strict types and strict booleans, but in between Groovy is coercing and converting all types automatically. This approach has been used all over the place.

And once you're on the coercion band-wagon, I think the end result is exactly what Groovy has.