Amanzi: 2014

Aug 12, 2014

Neo4j read-only transactions still need success()

Since Neo4j 2.0 it has been necessary to wrap read-only database access in transactions just like you needed for write access in 1.X. This has necessitated refactoring of much legacy code, adding transactions around code that previously did not require it. It turns out that there is a subtlety to this, a case when it is easy to make a mistake that does not necessarily cause any immediate problems.

Let's first explain a common pattern that leads to this issue. Consider a method with multiple return statements:

public String getName(Node node) {
    if(node.hasProperty("name")) {
        return node.getProperty("name");
    } else {
        return null;
    }
}

This code is a bit contrived, but it does represent a common pattern, multiple return statements in code that does read-only access to the database. I've seen much more complex cases, but this one suffices to demonstrate the problem. So, what happens if we add a transaction? The easiest is to wrap all code in the method:

public String getName(Node node) {
    try (Transaction tx = node.getDatabase().beginTx()) {
        if(node.hasProperty("name")) {
            return node.getProperty("name");
        } else {
            return null;
        }
    }
}

You might immediately notice I left out the tx.success() call. The reason I did this was two-fold:

I would need to add multiple calls, one before each return.
If I follow the pattern of calling tx.success() after all database access, I should create a new local variable for the result of the getProperty() call making the code mode complex.
It turns out that this code runs fine, so why bother with the tx.success() call?

This last point is the subtle catch. While this code seems to run fine, there are cases when it really does not run fine. Before explaining those cases, let's just look at how the code would look if we were strict with the tx.success() call.

public String getName(Node node) {
    try (Transaction tx = node.getDatabase().beginTx()) {
        if(node.hasProperty("name")) {
            String name = node.getProperty("name");
            tx.success();
            return name;
        } else {
            tx.success();
            return null;
        }
    }
}

Clearly this is noticeably more verbose than the previous code. And since the previous code worked, it is extremely tempting to code like that. And I did. And all hell broke loose and it took a while to figure out why.

It turns out that while a single transaction like this does run fine without the tx.success() call, it does not really work with nested transactions. If you have other code calling your method, and that code also has transactions, and those transactions remember to call tx.success(), then an exception will be raised. The inner transaction, even though it becomes a dummy transaction, will mark the transaction as not having completed successfully, and if the outer transaction does mark success, the database will reject this case, and throw a:

TransactionFailureException

 - Unable to commit transaction

with cause:

RollbackException

 - Failed to commit, transaction rolled back

and no further information.

When I first saw this, I could not understand where it was coming from. I was in the process of updating legacy code by adding more read-only exceptions, and successfully fixing several unit test cases. However, I found that after fixing some unit tests, other apparently unrelated unit tests started failing with this exception. It occurred because the first unit tests added transactions deeper in the call stack, the same methods used by other tests that had already added transactions higher in the stack. Normally it does not matter if you nest transactions. This is a common pattern. So it was not obvious, at first, that the nesting would cause the problem, but it sure did.

To clarify the cases when this problem occurs I wrote a little test code:

private void readNamesWithNestedTransaction(boolean outer, boolean inner) {
    try (Transaction tx_outer = graphDb.beginTx()) {
        try (Transaction tx_inner = graphDb.beginTx()) {
            readNames();
            if (inner) {
                tx_inner.success();
            }
        }
        if (outer) {
            tx_outer.success();
        }
    }
}

This method can be called with two booleans, which control whether tx.success() is called in the inner transaction, outer transaction or both. It turns out only one of the four possible combinations will cause an exception (tested against Neo4j 2.1.2):

		Outer
		success	no-success
Inner	success
Inner	no-success	Exception

The cases where the inner and outer are the same, no exception is thrown because there is no inconsistency. If they are both true (success called for both), the transaction succeeds cleanly. If they are both missing the success call, then the entire transaction rolls back, but since it is a read-only transaction, nothing will actually be done. If the inner transaction is true, and the outer false, the outer ignores the success of the inner transaction, and decides on a full rollback, which again does nothing. However, in the specific case when the inner does not have success(), but the outer does, it marks this as an invalid case, and throws this exception despite being read-only.

My conclusion? While it is not really wrong to skip the success() on read-only transactions, it is certainly a very bad idea. Perhaps you can get away with it on the outer-most main loops of an app, when you are really certain you will not be called by something else, but be very, very wary of forgetting the success call in any methods you expect to be called from elsewhere. Either have no transaction at all (rely on the calling code to add transactions), or accept the higher verbosity that comes with writing the code right.

May 10, 2014

Using GoPro Time-Lapse Photos for Mapillary

Over the last half year mapillary has grown in popularity with over half a million photos uploaded, more than 100k in the last ten days alone! And this is despite the fact that at first glance it feels like you have to use a normal hand-held smartphone to do the job. What if you could use a mounted action camera, like a GoPro to take photos for mapillary? Mounted on a bicycle, helmet, even a car? It turns out you can, and last week I did a little drive and collected about 1700 photos in half an hour using TimeLapse, consuming only 3.4G of my 32GB SD. I could have driven for over four hours and taken around 150k photos before filling my card!

However, it is not entirely trivial to do this, and so I thought I'd explain a bit about how it is done. The main problem is that the GoPro has no GPS. How then can the map know where to place the photos? It cannot. We have to tell it, and we have to use a separate GPS to do so. In my case I've been using my HTC One smartphone, which I also previously for normal mapillary photos. In brief the procedure is:

Mount the GoPro somewhere, on a bicycle or car using the various mounting accessories available.
Start a GPS tracking app. I used geopaparazzi, but there are many others that will work just as well.
Configure the GoPro to take time lapse photos. I choose 5MP, medium, narrow, 1s.
Start the GoPro and go for a drive or cycle tour.
Upload the photos and GPX track to a computer.
Geolocate the photos using time-correlation to the GPX track using gpx2exif.
Upload the photos to Mapillary.

OK. That doesn't sound too difficult, right. So let's describe these steps in more detail, using the drive I made last week.

I mounted the GoPro onto the front left fender of my Mazda RX8 to get a nice view of the center of the road.

Then I mounted my HTC One into a dedicated HTC mount inside the car. Using the 'GoPro App', I could connect to the camera and get a preview of what the GoPro can see. This is convenient, but not really necessary, as the preview disappears as soon as the recording starts. It is sufficient to just have the phone lying in the car somewhere, as long as it gets a good GPS signal. If you do use the GoPro app, take the opportunity to go to the settings and set the camera time to minimize the time difference between the two devices. As you'll see later, you will need to correct for differences in order to perform a correlation, and the smaller the correction the easier the job.

Make sure the GoPro is configured for time-lapse photos with the photo settings you like. This is easy to achieve using the GoPro App, but can also be done on the GoPro itself. I chose 5MB, Medium, Narrow to get a view I felt was a somewhat similar to what I would see with my phone, with a focal length of approximately 20mm compared to a 35mm camera. The GoPro defaults to very wide angle views with a lot of distortion, so I avoided that for this drive. In another blog I plan to describe an alternative scenario where I got mapillary photos from a wide angle 4K video stream. That was more complex, so we'll skip that for now. With time lapse I selected a 1s time gap to give about 1 photo every 10-20m for speeds between 40 and 80 km/h. The mapillary default of one photo every 2s is more suitable for cycling, and I was certainly planning to drive faster than I cycle!

Start the GPS tracking app. In my case I used geopaparazzi. In this app there is a button for starting a recording. I clicked that button and accepted the suggested file name for storing the track. OK. Now we all ready to go. Just to start the camera recording and drive!

When you have finished the drive, stop the recording of both the camera and the GPS. Now the real work starts. We need to perform a geolocation before we can upload to Mapillary. In Geopaparazzi, I exported to GPX, then used a file manager app to find my GPX file and email it to myself. For the GoPro I simply pulled out the SD card and plugged into into the side of my laptop and copied the photos over to a local directory.

The first thing I wanted to do was see what the drive looked like, so I ran:

geotag -g väla_to_billesholm.gpx \
       -o väla_to_billesholm.png \
       -D 1 -s 2048x2048

This produced a nice high res view of the entire map. Note the use of the -D option to put a big gap between the marker labels. This is necessary because the geotag default is set for much smaller tracks, like a quick bicycle ride. From the image produced, we can see the times and locations for key landmarks on the drive. We will want to zoom in on a few distinctive areas we can use to manually check that the camera and GPS clocks are synchronized, or determine exactly the error between them. We can correct for this error when correlating, so it is important to get it right.

I ran the command:

geotag -R 20140505T12:39:00+02-20140505T12:41:00+02 \
       -g väla_to_billesholm.gpx \
       -o krop.png -D 0.1

which produced the image above. I can look for a photo when going under the highway bridge and confirm the times.

The EXIF data for this photo shows it was taken at 12:39:11. When looking at the map, we can see that we passed under the tunnel at 12:39:14. So we have a 3s error. We can use this in the geolocation process, but let's double check with another photo.

I created a map of the track through Mörarp, because I could identify features like buildings and intersections. It is a bad idea to use any intersection that you might have paused at, like I did at the right turn above, but look for landmarks along the drive where you were sure to be moving. I looked at the first road to the right, near the top of the map, and found the following photo at 12:46:10.

I've taken note of things I know to be at the specified location, the speed warning sign in the middle, the white marker on the right for the intersection, and the tree and lamp-post on the side.

One interesting double-check you can also do if you happen to be in an area covered by google street view, is visually compare those images.

In google street view we can see the speed warning, the intersection marker and the tree and lampost, but notice that the fence is not there an fir trees can be seen instead. Clearly someone has been building since the google car passed! Google said the image was captured in September 2011, which is about 2.5 years ago, so things can certainly change.

On the GPX track of Mörarp, we see this intersection was passed at 12:46:13, which is 3s after the camera image timestamp. Again we have a 3s offset. This is good news. It appears the entire track was off by the same amount. We can go ahead and correlate all 1500 photos using the command:

geotag -g väla_to_billesholm.gpx \
       20140505_TimeLapse/*JPG -t 3 -v

I set the time offset with the '-t -3' option, and used '-v' to watch the progress. Since the script is actually wrapping the command-line tools 'exif_file', a new process is started for editing each file, and this can take a while, but in the end your images will have GPS information in the GPX.

Once the images are geolocated, then you can upload them to mapillary.com. Simply login, then click on your name, choose 'upload images', then click the 'choose files' green button. Once the files are all listed, scroll to the bottom and click the 'Start Uploading' button. It will go a slightly paler green, so it is not that obvious that the uploads have started. Just scroll to the top again and you should see thin red progress bars below each image as they are uploaded.

And finally, once the upload is completed, click your name and select 'my uploads', and you should see a new image for your track.

Click you most recent upload to browse the photos in mapillary!

It might take time for the photos to be completely processed, so don't worry if they are not immediately available. Just come back and look a little later. In the meantime there is something else really fun to play with!

Time Lapse Video

The GoPro settings for taking these photos were called 'Time Lapse' for a reason. They can be used to make a video. Since we recorded one frame a second, if we make a video at 25fps, we will have a 25x speedup. This is pretty cool! See what I made below....

This video was made using the following process:

Rename all photos to have names with numbers starting at 0000. In my case I used foo-0000.jpeg. To make life easy, I wrote a Ruby script to create hard links with the appropriate names. Then you can use the ffmpeg command to make a video like this:

ffmpeg -f image2 -i foo-%04d.jpeg \
       -r 25 -s 1280x960 ../myvid.mp4

This command compresses the 7MP 4:3 images to 960p HD video.
Then I used OpenShot video editor to trim to 16:9, add audio, add the map, and compress to 720p medium quality for faster web uploads.

Installing gpx2exif

In this blog we made extensive use of the geotag command. This command is included in the 'gpx2exif' Ruby Gem. For this blog we made use of features available in version 0.3.6. However, at the time of writing, the published gem was at version 0.3.1. So I will explain both the easy way to install (assuming we get 0.3.6 published before you try this) and the slightly harder way (if 0.3.6 is not published, or you wish to make changes to the script yourself).

Installing Ruby on Ubuntu

The first thing you need is Ruby. This depends on the OS. I use Ubuntu 14.04 and RVM, so I recommend you refer to ruby-lang.org and rvm.io for advice on your own platform. Here I'll give instructions for my OS, so you will need to make some adjustments to suite your platform.

sudo apt-get install curl
curl -sSL https://get.rvm.io | sudo bash -s stable --ruby
sudo usermod -g rvm craig
# logout and login to get RVM group
source /etc/profile.d/rvm.sh

The code that creates the png images above makes use of ImageMagick. On Ubuntu at least this means you need to install a few dependencies first:

sudo apt-get install imagemagick imagemagick-doc libmagickwand-dev
gem install rmagick # Requires libmagickwand-dev to compile

Installing from rubygems.org

Once ruby is installed, simply install the gem:

gem install gpx2exif

Then list the installed gems with 'gem list' to see which version you got. If it is not 0.3.5 or later, then use the instructions below.

Installing from github

Install git, and then run the command:

git clone https://github.com/craigtaverner/gpx2exif
cd gpx2exif
bundle install
rake build
gem install pkg/gpx2exif-0.3.6.gem

If all goes well, you will have built and installed the latest Ruby Gem.

Have fun!

Apr 8, 2014

Paging huge cypher requests

Recently I typed a simple cypher command into the Neo4j browser that effectively 'killed' the server. By this I mean the server process went to 100% CPU usage and remained there. It became unusable. In retrospect I should have expected this, since the command I typed was going to hit about 80% of a 4.5 million record database - all in one transaction!

MATCH (c:GeoptimaEvent)
  SET c :ConfigCheck
  RETURN count(c)

This command finds all nodes labeled GeoptimaEvent and adds the label ConfigCheck. My goal was to change all node labels, first by adding the new one and then by removing the old one. But what happened instead was that the server started struggling to allocate memory to hold the entire 4M change transaction. No other requests to the server could be handled. Luckily Neo4j is resilient against failure. I simply needed to restart the server to get back to my previous state.

So, how do we do this properly?

I was giving a great suggestion by Jacob Hansson to split the work into blocks, and Ian Robinson who pointed out that the SKIP and LIMIT statements can apply to the WITH clause. This lead to the following command:

MATCH (c:GeoptimaEvent)
  WHERE NOT (c:ConfigCheck)
  WITH c LIMIT 1000
  SET c :ConfigCheck
  RETURN count(c)

See how this command will only change 1000 nodes, and only those that have not already been changed. This is achieved by first streaming (c:GeoptimaEvent) nodes through the filter WHERE NOT (c:ConfigCheck), taking only the first 1000 off the top of the stream using WITH c LIMIT 1000, and then applying the SET c :ConfigCheck label to the nodes. We return the number of nodes changed. By repeating the command until the result returned is zero, we can change all the nodes.

This command took only a few hundred milliseconds on our standard server (16GB RAM Quad Core i5-4670K, Neo4j 2.0.1 default installation in Ubuntu 12.04 LTS). However, we would have to repeat this command about four thousand times to change the entire database, so let's figure out how big we can go with our transaction size before performance becomes a problem.

By trying the same command with different LIMIT # settings, we can see that the performance scales nice and linearly up to around 400000 records. After this, things get noticably slower, and after 600000 nodes it gets really bad. What is happening here is that GC is kicking in. And if you go for large enough transactions you could exceed the maximum heap size.

We only needed to repeat the command ten times with a transaction size of 400,000 in order to change all 4,000,000 nodes. I was happy to do this in the Neo4j browser. If I needed to repeat the command many more times, I would have written a Ruby script and used Neography.

Now that we've added the new labels, we can remove the old ones by repeating the following command until it returns 0 changes:

MATCH (c:GeoptimaEvent)
  WITH c LIMIT 400000
  REMOVE c :GeoptimaEvent
  RETURN count(c)

A more complex example

The above case was pretty simple really. However, since then I've been faced with a much more challenging example, bulk copying properties from one part of the graph to another.

Let's start with a little background to the problem. The database above is actually a tree structure, with the leaf nodes representing the requests made to a web service. We wanted to data mine the Apache2 logfiles, and in order to calculate high performance statistics we build a tree structure with parent nodes representing the kinds of aggregations we would like to make. We imported the data using the csvtreeloader at https://github.com/craigtaverner/csvtreeloader, leading to a tree structure like:

(d:Device)
  -[:versions]-> (v:GeoptimaVersion)
    -[:days]-> (x:EventDay)
      -[:checks]-> (c:ConfigCheck)

We can ask queries like "How many service checks were there per day during March?":

MATCH (x:EventDay)-[r:checks]->()
  WHERE x.day >= "2014-03-01"
    AND x.day < "2014-04-01"
  RETURN x.day as Day, count(r) as Checks
  ORDER BY x.day DESC

This command returns very quickly and is used in dynamic websites providing instant statistics on the activity of the service.

The problem I faced was that some critical information about the service check, the PLMN code of the device making the check, was saved at the (c:ConfigCheck) level. Considering that we had about 10,000 devices and 4,000,000 config checks, any query on PLMN would hit 400 times as many nodes as needed. We needed to move this critical information up the tree. However, this is not trivial, because the most obvious command to do this will read all 4,000,000 ConfigCheck nodes and copy repeatedly the same information:

MATCH (v:GeoptimaVersion)-->(x:EventDay)-->(c:ConfigCheck)
  WHERE NOT has(v.plmn)
  SET v.plmn = c.mcc+'-'+c.mnc
  RETURN count(v)

This command has two problems:

It will read all 4,000,000 ConfigCheck nodes in one transaction (same problem as before)
It will set the same PLMN code over and over on the GeoptimaVersion node (wasted effort)

We can fix both problems with the following interesting command:

MATCH (d:Device)-->(v:GeoptimaVersion)
  WHERE NOT has(v.plmn)
  WITH d, v LIMIT 1000
  MATCH (v)-[:days]->()-[:checks]->(c)
  WITH d, v,
    filter(
      x in collect(
        distinct replace(c.mcc+'-'+c.mnc,'"','')
      )
      WHERE NOT x =~ '.*null.*'
    ) as plmn
    LIMIT 1000
  SET v.plmn = plmn
  RETURN v.version_name, v.plmn, plmn

The key points in the above command are:

We search first for the Device and Version, filtering for ones without the PLMN and using blocks of 1000. Since there are 10,000 devices, this command only needs to be run about 10 times. Each of these will hit about 10% of the database.
We search for all ConfigCheck events that each Device has and for each we apply the filter() method to combine them all into a single value for that device.
We finally set the value to the requisite parent node and return the results.

These commands each took about 20s to run on the same server. Considering how much is going on here, this is quite impressive performance, I think.

One part of the above command deserves a little more explanation. The construction of the PLMN. We called the following function:

filter(
  x in collect(
    distinct replace(c.mcc+'-'+c.mnc,'"','')
  )
  WHERE NOT x =~ '.*null.*'
) as plmn

What this does is:

Combine the properties c.mcc + '-' +c.mnc
This string contained double quotes, for example my own device has '"240"-"06"' and I expect to see '240-06' for Telenor, Sweden. We use the replace() function to remove the double quotes.
Then we reduce the set to distinct values only using distinct()
And use collect() to make a single set of all matching results.
And filter() to remove entries with the value 'null' in them.

This was a complex example, indeed. Took a while to figure out. The fun part of it, though, was that this could be done through a little trial and error in the Neo4j Browser. Once or twice I typed in a command that hit too much of the database, but a quick restart later and I was back on track. Once it worked, it worked well, and I was able to migrate the entire database quite quickly.

Amanzi