Graph Data Modelling (“Obligatory Neo4j” Part 2)

First of all – a quick “thank you” to the guys at Neo Technology and anyone else who commented on/ retweeted my last post.  It feels like there’s a friendly community around Neo4j and – let’s be honest – that’s as important as the software itself.

Secondly, I have further posts planned – this one is on data modelling, then I’m aiming to look at APIs and drivers, followed by internals and performance.  All will be posted to my Twitter feed as they are published.  Throughout, I’ll likely refer to the e-book “Graph Databases”, available for free here.  While it does focus heavily on Neo4j in parts, it also provides good general coverage of graph theory, graph databases and the wider NOSQL movement – all in only 200 pages.

cropped-graphdatabases_cover390x5121

So, to get on-topic – one of the major claims Neo4j make in the book is that their product has greater “Domain Affinity” than relational databases.  That is to say, it makes thinking about the real-world problem and thinking about the database you’ll use to model it much more like the same thing.

There are a couple of ways they go at this.  The first is to talk about “accidental complexity” in relational modelling – things that are necessarily part of the database but not the domain.  Think foreign keys and (especially) separate tables representing many-to-many relationships.  Working with relational technology every day, we may forget how counter-intuitive these techniques can be on first exposure.  We blur the line between relationships and properties, rather than treating the relationships as first-class concepts in their own right.

They also argue that performance considerations push us towards denormalisation of relational databases, which makes the data model less flexible in the face of change.  Given the challenging nature of relational schema changes in production, this can cause big problems.  Neo4j, conversely, prides itself not just on facilitating better data models, but on allowing them to change in a more agile, iterative way.

While the abstract concept of “accidental complexity” certainly interests me, the more straight-forward way they explain it is that you can go straight from white-board sketches of your business domain to the “ASCII art” structure of the Cypher query language, without really having to re-think or re-factor what you’ve already worked out.  That, I thought, sounds like a challenge!

Here, then, is a small, quick-and-dirty domain model I drew up, representing some sort of generic retail data set:

Neo4j shopping data model 1

I was then able to convert it into the following code:

CREATE (bob:Customer {name:”Bob”}), (jim:Customer {name:”Jim”}),
(leedsstore:Store {branch:”Leeds”}),
(york:City {name:”York”}), (manchester:City {name:”Manchester”}),
(yorkshire:County {name:”Yorkshire”}), (greatermanchester:County {name:”Greater Manchester”}),
(bob)-[:SHOPS_AT]->(leedsstore), (jim)-[:SHOPS_AT]->(leedsstore),
(leedsstore)-[:LOCATED_IN]->(yorkshire),
(bob)-[:LIVES_IN]->(york), (york)-[:LOCATED_IN]->(yorkshire),
(jim)-[:LIVES_IN]->(manchester), (manchester)-[:LOCATED_IN]->(greatermanchester);

The resulting graph displays in the Neo4j browser like so:

Neo4j shopping data model 2

I would have to agree that in the process of doing this, the challenge was to get the “real-world” model right, not to code it into Neo4j.  The step I often have to go through when working with SQL Server – of “this is how it works, now what are the required tables?” – was a non-issue – my only question was whether I had the right data in there for my business queries.

In fact, I noticed while looking back at the model that I could easily have put more detail into the address hierarchy; this is something that should be relatively easy to fix in Neo4j, and indeed they talk about “cross-domain” modelling where you can have (for example) a detailed set of geographical data nodes/ relationships which interact with other data elements as required.

Taking the model a bit further, suppose I want to look at customers who shop at the Leeds store but don’t live in the same county.  Maybe it’s my only store in the North of England and has been doing well, I want to open another branch and need ideas on where to locate it.  This again results in some fairly intuitive Cypher code, which will return customer “Jim” and city “Manchester”:

MATCH (cust:Customer)-[:SHOPS_AT]->(:Store{branch:”Leeds”})-[:LOCATED_IN*1..4]->(c_str:County),
(cust)-[:LIVES_IN]->()-[:LOCATED_IN*0..3]->(city:City)-[:LOCATED_IN]->(c_cust:County)
WHERE c_str <> c_cust
RETURN cust, city;

Note the variable-length paths (e.g. “[:LOCATED_IN*1..4]”) – these mean that if I do later add more address details to the model, the query will work fine as long as all the connections between the store/ customer and the county are of the same type.  Variable-length paths often don’t perform especially well, however, so I’ve put a low upper bound on the variability and am only using them because I know they may be needed in this case.

I can also do a more useful reporting-style query to produce a list of cities, with numbers of customers as an aggregate and used to give me a top 10 in descending order:

MATCH (cust:Customer)-[:SHOPS_AT]->(:Store{branch:”Leeds”})-[:LOCATED_IN*1..4]->(c_str:County),
(cust)-[:LIVES_IN]->()-[:LOCATED_IN*0..3]->(city:City)-[:LOCATED_IN]->(c_cust:County)
WHERE c_str <> c_cust
WITH city, COUNT(*) AS CustomerCount ORDER BY COUNT(*) DESC
RETURN city, CustomerCount LIMIT 10;

With the tiny example data set I’ve created, this simply tells me there is one matching customer in Manchester – but with realistic production data, it could give me actionable information pretty fast.

We can see here though that the query code is already getting slightly more complicated – to get a top 10 by customer count, I’m having to use the WITH clause, which pipes results through for further querying.  There are examples significantly more complex than this in the e-book (notably on page 132), and doubtless even more so in some production codebases.  While I love the steps Neo Technology are taking to create a language in which data modelling can be more intuitive, I’d be a bit wary of assuming Cypher queries can remain completely “whiteboard-friendly” under situations of increasing complexity.

As with SQL – which, in the main, is also a fairly intuitive, “plain-English” kind of language – a deeper understanding seems to be needed as detailed, accurate querying and good performance become more important.  I’ll maybe look at this in more detail in my post on performance and internals, but there are definitely cases where your implementation data model needs to be somewhat guided by knowledge of the underlying platform.  For example, if you have variations on the same relationship type (i.e. addresses for home and business), it’s more efficient to model separate -[:HOME_ADDRESS]-> and -[:BUSINESS_ADDRESS]-> relationships than to use properties – i.e. -[:ADDRESS {type:”Home”}]->, because of the way the underlying storage works.  This is completely manageable (the book has more details – pages 65-66) but certainly needs taking into consideration, and does have knock-on effects on what a good data model needs to look like.

Neo4j themselves are also refreshingly up-front about the importance of avoiding “lossy” data models – see the e-mail example on pages 50-61.

Before I wrap up – just one more constructive comment around the otherwise smooth process of Neo4j modelling and querying.  As far as I’ve been able to tell, the database engine doesn’t have any real native support for dates, which leaves you with three options:

  • Integer datestamp properties (fast performance but opaque at the database level, although should be easily convertible by any client code)
  • ISO8601 string representations as properties (clear but bulkier and perform less well)
  • A domain of nodes representing a time hierarchy (somewhat complex to set up and long-winded to query but apparently does offer some modelling and/or performance benefits – see pages 70-71 of the e-book)

I don’t think the question of how to handle dates should put anyone off using Neo4j but – especially given the potential for graph modelling to be appreciated by analysts and business users as well as software specialists – it would be great if it could become even easier to use dates directly in Cypher at some point in the future.

Overall, Neo4j is clearly set up to exploit some very natural ways of modelling data, and to build upon the mature and well-understood field of Graph Theory.  I haven’t even mentioned the fascinating concept of “Triadic Closure” – which you can read about in Chapter 7 or see in this video by Neo’s Jim Webber – or the built-in support for shortest-path algorithms such as Dijkstra and A*.  As per my last post, I can only really sign off with a strong recommendation to check Neo4j out and a hope that – in Jim’s words – the roadmap for the future is “more awesomeness”.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s