Jan 16, 2013

Unwanted side effects

Last week the performance of Neo4j Spatial improved by 100 times! What the..? Can this be real? Does it mean Neo4j Spatial was always 100 times too slow? Of course, the truth is more subtle than that. Firstly, it was only the performance of the 'add geometry' function that changed, and only for the newer IndexProvider API.

Take a look at the chart above. The yellow and green show performance starting at around 300 geometries/second improving quickly to over 1000 nodes/second on my laptop. This was after applying Axel Morgner's bug-fix. Before the fix, the blue and red lines show the performance dropping quickly to as low as 10 geometries/second once the index contains around 20000 geometries. That is way too slow. In fact the performance was only acceptable for the first 1000 geometries or so, only good for small cases, perhaps the test cases. And that is the first hint as to why this was not noticed until recently.

The idea that a bug fix can break something else is not new. And code that works reliably in one place can fail miserably in another. Sometimes the results can be quite dramatic, as we see above. So what really happened here? It all started with a bit of old code that chained all geometry nodes together in a long chain. This code was not really needed, except by one obscure test case, and one unused API method. It was originally coded as a trivial example of a domain model, since Neo4j Spatial allows you to add an index to any graph structure you wish. But as time went by this model became the default case for whenever users did not have their own domain model, which is all of the time when using the higher level API's. And, of course, no-one bothered to remove it, because it had no obvious harmful side effects. Yet!

And then two things happened that are related to this code. Firstly, in 2011 Peter Neubauer developed a new API on Neo4j Spatial enabling the use of the Spatial Index through the standard Neo4j index API. This new code wanted to check against adding the same node twice to the same index. The easy solution to this was to use the old code, the old long chain of connected geometry nodes. This was tried and tested code, a tried and tested data structure, so it should be good, right? Wrong!

Then in September 2012 Axel Morgner noted that under some circumstances the old code caused NullPointerExceptions. This occurred if there was an error during the transaction that added the geometry. This error could occur in any application level code, and would simply roll back the transaction, which should be fine, but not with that old code. There was a static reference to the end of the geometry chain, a variable called previousGeomNode, that would end up pointing to a Node that had been rolled back. The simple fix was to not keep this reference at all. But that meant that each addition would take a little longer than the previous, because it would have to traverse the chain to find the end before adding to it. You can see quickly that this will escalate to a serious problem. After some upgrades Axel found he could not reproduce the bug, so it was never fixed because the obvious fix would cause a more serious bug.

And then the real problem came. Not from the NPE, but from the shift of users from the old API to the new one. It turned out that the new API was using a version of the long-chain code that did not have the NPE bug. No static reference. Nice clean code. But that meant it should have the expected performance problem. And it sure did. As more users started loading larger datasets, the performance of Neo4j Spatial fell through the floor.

In November 2012, Jonathan Winterflood noticed that performance dropped dramatically as more geometries were added through the IndexProvider API, and submitted bug report 72. The problem was discussed on the forums and others noticed this problem too. Rune Sörensen said "this is a show stopper for spatial data". And then the community came to the rescue. Jonathan submitted a pull request for a good test case demonstrating the problem. Then Axel Morgner submitted a series of pull requests fixing the problem. Finally Craig merged in the fixes and did some tests on the old and new code, resulting in the dramatic chart above.

But this was not all. I looked deeper at the problem and felt that the old 'long chain' code was the real problem here, and while Axel's fixes removed its used from the new API, it was still lying there tempting others to see what bugs they could write in future. So I removed this code almost completely. All standard code, test cases and formal API's no longer use this code. The consequence is the geometries are no longer ordered by insert order, something no-one has requested (yet). But just in case they do, or they already rely on this behavior in the old code, a 'new' class now exists called OrderedEditableLayer that uses this old code, in a small, self-contained way. Developers have to explicitly make use of that code if they want this behavior. And right now the only code I know that still uses this is the ShapefileImporter, and then only if you pass a special flag to enable geometry ordering.

OK, so all is well in spatial again. But before ending, I wanted to give a quick overview of the various API's mentioned above. Let's move away from the abstract discussions on side effects and get into some actual code. Here are some examples, each using a different one of the various API's mentioned above.

IndexProvider: compatible with other Neo4j indexes

Let's start with the newer API because it was the one with the bug. This API does not give you access to the full feature set of Neo4j-Spatial, but is convenient because of its compatibility with other Neo4j indexes. In the example below we show how to create an index, add geometries to it and then query for whether they are inside a polygon:

// Create the index
Map<String, String> config = SpatialIndexProvider.SIMPLE_POINT_CONFIG;
Index<Node> index = db.index().forNodes("layer1", config);

// Add the geometry
Transaction tx = db.beginTx();
Node n1 = db.createNode();
n1.setProperty("lat", (double) 56.2);
n1.setProperty("lon", (double) 15.3);
index.add(n1, "dummy", "value");

// Query the index
IndexHits<Node> hits = index.query(
    "POLYGON ((15 56, 15 57, 16 57, 16 56, 15 56))");

This example was extracted from the code in IndexProviderTest, which contains many more examples for how to interact with the index.

SpatialDatabaseService: the original API

Here you can access the RTree in a very flexible way, and can also plugin your own GeometryEncoders and do all kinds of wild things with Neo4j Spatial. For the purpose of comparison, I'll focus only on that part in common with the other API, adding and querying simple point geometries.

// Create SpatialDatabaseService on existing database
SpatialDatabaseService geo = new SpatialDatabaseService(db);
EditableLayer layer = (EditableLayer) geo.createSimplePointLayer("test", "lon", "lat");

// Add the geometry
Transaction tx = db.beginTx();
Node n1 = db.createNode();
n1.setProperty("lat", (double) 56.2);
n1.setProperty("lon", (double) 15.3);
layer.add(n1);    // Compare this to the index.add(node,key,value) method above

// Use GeoPipeline to find geometries around point
List<SpatialDatabaseRecord> results = GeoPipeline
  .startNearestNeighborLatLonSearch(layer, new Coordinate(15.3, 56.2), 1.0)

This code is using the GeoPipeline API to perform the search. This is a useful approach because GeoPipeline can actually chain many operations together, resulting in quite complex GIS queries performed in a streaming, and effecient way. Look at the examples in TestSimplePointLayer.java (for point data) and GeoPipesTest.java for more complex examples.