Apr 8, 2014

Paging huge cypher requests

Recently I typed a simple cypher command into the Neo4j browser that effectively 'killed' the server. By this I mean the server process went to 100% CPU usage and remained there. It became unusable. In retrospect I should have expected this, since the command I typed was going to hit about 80% of a 4.5 million record database - all in one transaction!
MATCH (c:GeoptimaEvent)
  SET c :ConfigCheck
  RETURN count(c)
This command finds all nodes labeled GeoptimaEvent and adds the label ConfigCheck. My goal was to change all node labels, first by adding the new one and then by removing the old one. But what happened instead was that the server started struggling to allocate memory to hold the entire 4M change transaction. No other requests to the server could be handled. Luckily Neo4j is resilient against failure. I simply needed to restart the server to get back to my previous state.

So, how do we do this properly?

I was giving a great suggestion by Jacob Hansson to split the work into blocks, and Ian Robinson who pointed out that the SKIP and LIMIT statements can apply to the WITH clause. This lead to the following command:
MATCH (c:GeoptimaEvent)
  WHERE NOT (c:ConfigCheck)
  WITH c LIMIT 1000
  SET c :ConfigCheck
  RETURN count(c)
See how this command will only change 1000 nodes, and only those that have not already been changed. This is achieved by first streaming (c:GeoptimaEvent) nodes through the filter WHERE NOT (c:ConfigCheck), taking only the first 1000 off the top of the stream using WITH c LIMIT 1000, and then applying the SET c :ConfigCheck label to the nodes. We return the number of nodes changed. By repeating the command until the result returned is zero, we can change all the nodes.

This command took only a few hundred milliseconds on our standard server (16GB RAM Quad Core i5-4670K, Neo4j 2.0.1 default installation in Ubuntu 12.04 LTS). However, we would have to repeat this command about four thousand times to change the entire database, so let's figure out how big we can go with our transaction size before performance becomes a problem.

By trying the same command with different LIMIT # settings, we can see that the performance scales nice and linearly up to around 400000 records. After this, things get noticably slower, and after 600000 nodes it gets really bad. What is happening here is that GC is kicking in. And if you go for large enough transactions you could exceed the maximum heap size.

We only needed to repeat the command ten times with a transaction size of 400,000 in order to change all 4,000,000 nodes. I was happy to do this in the Neo4j browser. If I needed to repeat the command many more times, I would have written a Ruby script and used Neography.

Now that we've added the new labels, we can remove the old ones by repeating the following command until it returns 0 changes:
MATCH (c:GeoptimaEvent)
  WITH c LIMIT 400000
  REMOVE c :GeoptimaEvent
  RETURN count(c)

A more complex example

The above case was pretty simple really. However, since then I've been faced with a much more challenging example, bulk copying properties from one part of the graph to another.

Let's start with a little background to the problem. The database above is actually a tree structure, with the leaf nodes representing the requests made to a web service. We wanted to data mine the Apache2 logfiles, and in order to calculate high performance statistics we build a tree structure with parent nodes representing the kinds of aggregations we would like to make. We imported the data using the csvtreeloader at https://github.com/craigtaverner/csvtreeloader, leading to a tree structure like:
  -[:versions]-> (v:GeoptimaVersion)
    -[:days]-> (x:EventDay)
      -[:checks]-> (c:ConfigCheck)
We can ask queries like "How many service checks were there per day during March?":
MATCH (x:EventDay)-[r:checks]->()
  WHERE x.day >= "2014-03-01"
    AND x.day < "2014-04-01"
  RETURN x.day as Day, count(r) as Checks
This command returns very quickly and is used in dynamic websites providing instant statistics on the activity of the service.

The problem I faced was that some critical information about the service check, the PLMN code of the device making the check, was saved at the (c:ConfigCheck) level. Considering that we had about 10,000 devices and 4,000,000 config checks, any query on PLMN would hit 400 times as many nodes as needed. We needed to move this critical information up the tree. However, this is not trivial, because the most obvious command to do this will read all 4,000,000 ConfigCheck nodes and copy repeatedly the same information:
MATCH (v:GeoptimaVersion)-->(x:EventDay)-->(c:ConfigCheck)
  WHERE NOT has(v.plmn)
  SET v.plmn = c.mcc+'-'+c.mnc
  RETURN count(v)
This command has two problems:
  • It will read all 4,000,000 ConfigCheck nodes in one transaction (same problem as before)
  • It will set the same PLMN code over and over on the GeoptimaVersion node (wasted effort)
We can fix both problems with the following interesting command:
MATCH (d:Device)-->(v:GeoptimaVersion)
  WHERE NOT has(v.plmn)
  WITH d, v LIMIT 1000
  MATCH (v)-[:days]->()-[:checks]->(c)
  WITH d, v,
      x in collect(
        distinct replace(c.mcc+'-'+c.mnc,'"','')
      WHERE NOT x =~ '.*null.*'
    ) as plmn
    LIMIT 1000
  SET v.plmn = plmn
  RETURN v.version_name, v.plmn, plmn
The key points in the above command are:
  • We search first for the Device and Version, filtering for ones without the PLMN and using blocks of 1000. Since there are 10,000 devices, this command only needs to be run about 10 times. Each of these will hit about 10% of the database.
  • We search for all ConfigCheck events that each Device has and for each we apply the filter() method to combine them all into a single value for that device.
  • We finally set the value to the requisite parent node and return the results.
These commands each took about 20s to run on the same server. Considering how much is going on here, this is quite impressive performance, I think.

One part of the above command deserves a little more explanation. The construction of the PLMN. We called the following function:
  x in collect(
    distinct replace(c.mcc+'-'+c.mnc,'"','')
  WHERE NOT x =~ '.*null.*'
) as plmn
What this does is:
  • Combine the properties c.mcc + '-'  +c.mnc
  • This string contained double quotes, for example my own device has '"240"-"06"' and I expect to see '240-06' for Telenor, Sweden. We use the replace() function to remove the double quotes.
  • Then we reduce the set to distinct values only using distinct()
  • And use collect() to make a single set of all matching results.
  • And filter() to remove entries with the value 'null' in them.
This was a complex example, indeed. Took a while to figure out. The fun part of it, though, was that this could be done through a little trial and error in the Neo4j Browser. Once or twice I typed in a command that hit too much of the database, but a quick restart later and I was back on track. Once it worked, it worked well, and I was able to migrate the entire database quite quickly.