Performance in Neo4j (“Obligatory Neo4j” Part 4)

This is my fourth post on the popular open-source graph database Neo4j, and I’ll be taking a look at some ideas about how performance can best be managed.  To see the full series, you can check out the Neo4j tag on this blog – alexdgarland.com/category/neo4j/.

It’s been about a month since my last post.  In the meantime I’ve been busy, playing around with C/C++, Hadoop & Linux – but it’s definitely time to get back to looking at Neo4j.   This is the final part of my “Obligatory Neo4j” series – although almost certainly not my last word on the topic of Neo4j – and we’re now at the point where we should be able to picture some of the functional aspects of a system built on graph technology.  The question is, can we make it run quickly and efficiently?

As ever, there are already some good resources available from Neo Technology.  This includes not only the Graph Databases book, but also this video on Cypher query optimisation from Wes Freeman and Neo’s Mark Needham:

One thing that comes out of that video very clearly for me is that the query optimiser for Neo4j – while already useful – is going through a process of evolution that the equivalent functionality in relational databases such as SQL Server and Oracle more-or-less completed years ago.  When it comes to things like the ordering of WHERE clauses and the placement of filters in the MATCH or WHERE parts of the query, developers using Neo4j need to handle tuning in quite a manual way.

This differs from, say, SQL Server, where the cost-based optimiser will take a declarative query, evaluate various logically equivalent ways of running it and typically reach a “good enough” execution plan without detailed input from the user.  The statement in the video that “soon Cypher will do a lot of this for you” suggests that Neo4j will be heading in the same direction in future versions.

I’d tend to look on the bright side regarding this stage in Neo4j’s development – it’s an interesting time to start looking at the product, most performance tuning cases should not be too difficult even now and as the optimiser inevitably becomes more functional, it will only get easier, while also leaving anyone who has manually optimised queries with a deeper insight into how the database engine works.

At this point, I’ll note a few principles that are very familiar from working with SQL Server:

  • Queries can typically be made to perform better by reusing execution plans (including parameterisation of variables in submitted Cypher calls, which also protects against query injection).
  • More RAM is better – while it’s not a silver bullet, adding more memory to a server will almost always help performance, allowing better caching of plans and more head-room for processing and storing results.
  • Queries that perform scans of entire data structures (relational tables, graphs, labels etc.) are generally to be avoided if the results you want don’t include the entire data set.  Some form of selective seek or lookup is likely to be quicker and less resource-intensive.
  • These selective actions can be made easier by following a smart indexing strategy.

Diving deeper though, the differing nature of graph storage starts to come into play.  There are specific characteristics to the work that Neo4j has to do to return certain queries, which can be advantageous in some circumstances.

The key differences come from a divergence in the way data is structured on disk and in memory.  A relational database is built around tables (“relations”) as fundamental structures and, while indexes may help optimise particular paths to access the data, they are still typically “global” across an entire table.  Relationships between tables are navigated via join columns, which are typically included in one or more of these global indexes.  As the size of the tables grows, the time and cost to navigate them is likely to increase significantly, even with a good indexing strategy in place.

Neo4j takes a different approach, structuring data in such a way that relationships between nodes are completely first-class entities and aiming to achieve “index-free adjacency”.   The implementation basically takes the form of pointer arithmetic over contiguous arrays of fixed-size primary records, allowing constant-time random access to nodes and relationships and then linking to other resources such as properties as required.  A lot more detail on this model and the advantages it can have is available in Chapter 6 of the Graph Databases book.

Setting aside the fine detail, it all leads towards one thing.  That is to say, once one has found an entry point to the graph, the cost of accessing related nodes is proportional not to the total size of the graph database (which can be very large) but to the size of the area of the graph containing the data you’re actually interested in.  The potential advantages of this approach should be obvious.

Nothing comes for free though.  To reliably get this kind of high, highly-scalable performance, it is necessary to design and query your Neo4j database in a way which works well with the underlying graph model and the associated data structures.

The resources provided by Neo Technology give some good hints on how to achieve this, but the following are definitely worth considering:

  • Make sure you understand the limits you’re able to place on the parts of the graph that will be used and searched.
  • Make sure you have explicitly created the index or indexes you need to support this restriction.
  • Check that executed Cypher queries actually behave in the way you expect; it sounds like the “profile” command could be helpful with this.
  • Understand how different parts of your query work in isolation and consider piping only the required results between them using the WITH keyword.
  • Minimise the use of variable-length paths in pattern matches – e.g. “(node1)-[:RELATIONSHIP*1..5]-(node2)”.  While these make queries a lot more flexible and robust in the face of change, the database engine is forced to check more possible paths and hence do more work in executing the query.
  • If you want to have a variant of a relationship – e.g. home vs business address – there are two ways to do it:
    • adding a property to each relationship – “-[:ADDRESS {type:”Home”}]-“
    • adding two different relationship types – “-[:HOME_ADDRESS]-” and “-[:WORK_ADDRESS]-“

Because the primary relationship type is stored directly in the structure representing the graph, whereas properties are stored at one remove, the second option will typically perform better.

  • Add additional relationships.  Complex and multiple interactions between nodes are much easier to add in Neo4j than in a relational database, so in the above example we could theoretically have both sets of relationships, as long as we’re sure they’re useful and we don’t mind the extra overhead of coding and maintenance.

As a specific performance optimisation, Neo Technology actually recommend adding direct relationships between nodes which duplicate the information implicit in less direct relationships.  For example, if two people both worked at the same company, rather than having to go via the company node every time we can add a direct relationship “-[:WORKED_WITH]-“.  These secondary sets of  relationships can be maintained asynchronously (as a frequently scheduled update job) in order to minimise the impact on initial write times when a node is created.

Beyond this list for working at a query level, there are other, lower-level options for performance improvement.  As mentioned in my last post, using the Traversal, Core or Kernel API rather than Cypher should allow progressively more focussed tuning of specific processes.

There are also some architectural and hardware-based options.  The commercially licensed version of Neo4j offers the possibility of master-slave replication for horizontal read-scaling, and a technique called cache-sharding can be used which increases the chance of required data being queued up in main memory.  There are more details on that here.

What has to be remembered when scaling out or considering performance more generally is that Neo4j is – in terms of NoSQL and the CAP Theorem – a well-performing but ultimately consistency-driven (roughly speaking, “CP”) system.  Read-scaling is one thing; high-end write scalability is an inherently hard problem which may better suit an availability-driven (“AP”) system using a key-value, document or column-oriented model.  These kind of systems explicitly accept a looser (“eventual”) definition of consistency in return for the ability to run across a very large number of distributed servers, and make choices about how to deal with the complexity and uncertainty that that creates.  Where this kind of extreme distribution of data storage (combined with high availability) is not required, Neo4j offers many other benefits in terms of greater expressivity, reliability and relative ease of use – and in most cases it seems like it should be able to perform very well if appropriately managed and tuned.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s