Graph Data Modelling (“Obligatory Neo4j” Part 2)

First of all – a quick “thank you” to the guys at Neo Technology and anyone else who commented on/ retweeted my last post.  It feels like there’s a friendly community around Neo4j and – let’s be honest – that’s as important as the software itself.

Secondly, I have further posts planned – this one is on data modelling, then I’m aiming to look at APIs and drivers, followed by internals and performance.  All will be posted to my Twitter feed as they are published.  Throughout, I’ll likely refer to the e-book “Graph Databases”, available for free here.  While it does focus heavily on Neo4j in parts, it also provides good general coverage of graph theory, graph databases and the wider NOSQL movement – all in only 200 pages.

cropped-graphdatabases_cover390x5121

So, to get on-topic – one of the major claims Neo4j make in the book is that their product has greater “Domain Affinity” than relational databases.  That is to say, it makes thinking about the real-world problem and thinking about the database you’ll use to model it much more like the same thing.

There are a couple of ways they go at this.  The first is to talk about “accidental complexity” in relational modelling – things that are necessarily part of the database but not the domain.  Think foreign keys and (especially) separate tables representing many-to-many relationships.  Working with relational technology every day, we may forget how counter-intuitive these techniques can be on first exposure.  We blur the line between relationships and properties, rather than treating the relationships as first-class concepts in their own right.

They also argue that performance considerations push us towards denormalisation of relational databases, which makes the data model less flexible in the face of change.  Given the challenging nature of relational schema changes in production, this can cause big problems.  Neo4j, conversely, prides itself not just on facilitating better data models, but on allowing them to change in a more agile, iterative way.

While the abstract concept of “accidental complexity” certainly interests me, the more straight-forward way they explain it is that you can go straight from white-board sketches of your business domain to the “ASCII art” structure of the Cypher query language, without really having to re-think or re-factor what you’ve already worked out.  That, I thought, sounds like a challenge!

Here, then, is a small, quick-and-dirty domain model I drew up, representing some sort of generic retail data set:

Neo4j shopping data model 1

I was then able to convert it into the following code:

CREATE (bob:Customer {name:”Bob”}), (jim:Customer {name:”Jim”}),
(leedsstore:Store {branch:”Leeds”}),
(york:City {name:”York”}), (manchester:City {name:”Manchester”}),
(yorkshire:County {name:”Yorkshire”}), (greatermanchester:County {name:”Greater Manchester”}),
(bob)-[:SHOPS_AT]->(leedsstore), (jim)-[:SHOPS_AT]->(leedsstore),
(leedsstore)-[:LOCATED_IN]->(yorkshire),
(bob)-[:LIVES_IN]->(york), (york)-[:LOCATED_IN]->(yorkshire),
(jim)-[:LIVES_IN]->(manchester), (manchester)-[:LOCATED_IN]->(greatermanchester);

The resulting graph displays in the Neo4j browser like so:

Neo4j shopping data model 2

I would have to agree that in the process of doing this, the challenge was to get the “real-world” model right, not to code it into Neo4j.  The step I often have to go through when working with SQL Server – of “this is how it works, now what are the required tables?” – was a non-issue – my only question was whether I had the right data in there for my business queries.

In fact, I noticed while looking back at the model that I could easily have put more detail into the address hierarchy; this is something that should be relatively easy to fix in Neo4j, and indeed they talk about “cross-domain” modelling where you can have (for example) a detailed set of geographical data nodes/ relationships which interact with other data elements as required.

Taking the model a bit further, suppose I want to look at customers who shop at the Leeds store but don’t live in the same county.  Maybe it’s my only store in the North of England and has been doing well, I want to open another branch and need ideas on where to locate it.  This again results in some fairly intuitive Cypher code, which will return customer “Jim” and city “Manchester”:

MATCH (cust:Customer)-[:SHOPS_AT]->(:Store{branch:”Leeds”})-[:LOCATED_IN*1..4]->(c_str:County),
(cust)-[:LIVES_IN]->()-[:LOCATED_IN*0..3]->(city:City)-[:LOCATED_IN]->(c_cust:County)
WHERE c_str <> c_cust
RETURN cust, city;

Note the variable-length paths (e.g. “[:LOCATED_IN*1..4]”) – these mean that if I do later add more address details to the model, the query will work fine as long as all the connections between the store/ customer and the county are of the same type.  Variable-length paths often don’t perform especially well, however, so I’ve put a low upper bound on the variability and am only using them because I know they may be needed in this case.

I can also do a more useful reporting-style query to produce a list of cities, with numbers of customers as an aggregate and used to give me a top 10 in descending order:

MATCH (cust:Customer)-[:SHOPS_AT]->(:Store{branch:”Leeds”})-[:LOCATED_IN*1..4]->(c_str:County),
(cust)-[:LIVES_IN]->()-[:LOCATED_IN*0..3]->(city:City)-[:LOCATED_IN]->(c_cust:County)
WHERE c_str <> c_cust
WITH city, COUNT(*) AS CustomerCount ORDER BY COUNT(*) DESC
RETURN city, CustomerCount LIMIT 10;

With the tiny example data set I’ve created, this simply tells me there is one matching customer in Manchester – but with realistic production data, it could give me actionable information pretty fast.

We can see here though that the query code is already getting slightly more complicated – to get a top 10 by customer count, I’m having to use the WITH clause, which pipes results through for further querying.  There are examples significantly more complex than this in the e-book (notably on page 132), and doubtless even more so in some production codebases.  While I love the steps Neo Technology are taking to create a language in which data modelling can be more intuitive, I’d be a bit wary of assuming Cypher queries can remain completely “whiteboard-friendly” under situations of increasing complexity.

As with SQL – which, in the main, is also a fairly intuitive, “plain-English” kind of language – a deeper understanding seems to be needed as detailed, accurate querying and good performance become more important.  I’ll maybe look at this in more detail in my post on performance and internals, but there are definitely cases where your implementation data model needs to be somewhat guided by knowledge of the underlying platform.  For example, if you have variations on the same relationship type (i.e. addresses for home and business), it’s more efficient to model separate -[:HOME_ADDRESS]-> and -[:BUSINESS_ADDRESS]-> relationships than to use properties – i.e. -[:ADDRESS {type:”Home”}]->, because of the way the underlying storage works.  This is completely manageable (the book has more details – pages 65-66) but certainly needs taking into consideration, and does have knock-on effects on what a good data model needs to look like.

Neo4j themselves are also refreshingly up-front about the importance of avoiding “lossy” data models – see the e-mail example on pages 50-61.

Before I wrap up – just one more constructive comment around the otherwise smooth process of Neo4j modelling and querying.  As far as I’ve been able to tell, the database engine doesn’t have any real native support for dates, which leaves you with three options:

  • Integer datestamp properties (fast performance but opaque at the database level, although should be easily convertible by any client code)
  • ISO8601 string representations as properties (clear but bulkier and perform less well)
  • A domain of nodes representing a time hierarchy (somewhat complex to set up and long-winded to query but apparently does offer some modelling and/or performance benefits – see pages 70-71 of the e-book)

I don’t think the question of how to handle dates should put anyone off using Neo4j but – especially given the potential for graph modelling to be appreciated by analysts and business users as well as software specialists – it would be great if it could become even easier to use dates directly in Cypher at some point in the future.

Overall, Neo4j is clearly set up to exploit some very natural ways of modelling data, and to build upon the mature and well-understood field of Graph Theory.  I haven’t even mentioned the fascinating concept of “Triadic Closure” – which you can read about in Chapter 7 or see in this video by Neo’s Jim Webber – or the built-in support for shortest-path algorithms such as Dijkstra and A*.  As per my last post, I can only really sign off with a strong recommendation to check Neo4j out and a hope that – in Jim’s words – the roadmap for the future is “more awesomeness”.

The Obligatory “I’m Trying Neo4j” Post (Part 1)

I remember reading someone (more-or-less-jokingly) suggesting that all blog software should come with a default first post entitled “I’m Trying Linux”.  That was a few years ago though, and while open-source operating systems are still a topic of interest, the big buzz amongst cutting-edge nerds in the last couple of years has of course been around NoSQL databases.

There are numerous relatively straightforward document/ key-value store options – which to simplify grossly, can be said to do the same sort of thing as relational databases, only with greater scalability, less impedance with object-oriented programming models and typically less strict guarantees on ACID properties.  But then there is also the one area which seems to really grab the attention of the old-school database crowd, because it offers not just to solve some of the more obvious issues with the relational model when applied to the web and OOP, but also to open up new possibilities in terms of data modelling.  That is to say, graph databases.

The clear leader in terms of mentions and mind-share in the graph space at the moment seems to be Neo4j.  Just in the last couple of days, I’ve had a couple of colleagues express an interest and noticed a post on the topic on one of the sites I read regularly, so now seemed like as good a time as any to finally install the free community edition and run it through its paces.

Just to be clear, I’m not going to provide an exhaustive survey of Neo4j or give instructions to get up and running – there’s a ton of stuff out there for that, a lot of which can be found simply by going to the Neo4j website.  Rather, I’m going to try and comment on what I think is interesting and/or useful about it as a tool. As per the title, I consider this very much the first part of many, so watch this space if you want to see more.

Suffice it to say, installation is extremely quick and easy and once running, you can connect by simply pointing your browser at localhost:7474, which gets you to the following:

Neo4j-Screenshot-1

See that tiny white bar near the top of the screen with the “$” sign to the left?  That’s actually a query window, and through it you can interface directly with Neo4j using their own declarative query language, Cypher.

The install comes with a set of training presentations (which you can also see the links to in the above screenshot), but having run through the basic examples I decided to do my own thing rather than dive into their “Movie Graph” code.  Just a matter of different learning styles I guess, but I find that pushing myself to think up something to program can often (not always) help to get a firmer grasp of a language’s core features faster.

I was watching the excellent “Margin Call”, which is set in a large Wall Street firm at the beginning of the 2008 financial crisis, but focused on a handful of key employees as information about the coming crisis makes its way up the organisational hierarchy.  Employee networks?  Political alliances?  That sounds basically like graph-node stuff, the kind of thing that you’d end up doing as some kind of bastardised adjacency list structure in any standard relational system – yep, that’ll do.

So first of all let’s create some employees, with a Cypher clause called (unsurprisingly) “CREATE”:

CREATE (Seth : Employee {name : ‘Seth Bregman’, title : ‘Junior Risk Analyst’}),
(Peter : Employee {name : ‘Peter Sullivan’, title : ‘Senior Risk Analyst’}),
(Eric : Employee {name : ‘Eric Dale’, title : ‘Head of Risk Management’}),
(Will : Employee {name : ‘Will Emerson’, title : ‘Head of Trading’}),
(Sam : Employee {name : ‘Sam Rogers’, title : ‘Head of Sales and Trading’}),
(Jared : Employee {name : ‘Jared Cohen’, title : ‘Head of Capital Markets’}),
(Sarah : Employee {name : ‘Sarah Robertson’, title : ‘Chief Risk Management Officer’}),
(John : Employee {name : ‘John Tuld’, title : ‘Chief Executive Officer’}),
(Seth)-[:REPORTS_TO]->(Eric),
(Peter)-[:REPORTS_TO]->(Eric),
(Eric)-[:REPORTS_TO]->(Will),
(Will)-[:REPORTS_TO]->(Sam),
(Sam)-[:REPORTS_TO]->(Jared),
(Jared)-[:REPORTS_TO]->(John),
(Sarah)-[:REPORTS_TO]->(John);

With some combination of intuition and RTFM, it should be pretty clear what’s happening there – we’re creating a node variable for each employee, with some key-value pairs to set up properties (name, title), and then adding a “reports to” relationship where appropriate.  Note at this point that relationships can also have properties and be assigned to variables using similar syntax as for nodes, although I haven’t taken advantage of it at this point.  Now to fire it off:

Neo4j-Screenshot-2

So okay – that’s definitely done something – but I thought Neo4j was supposed to give me a visualisation of my graph? Looks like I need to use the MATCH clause to find nodes (and note the use of RETURN – we have to do something with what the MATCH gives us, and in this case we want to get data back):

MATCH (ee:Employee)
RETURN ee;

Neo4j-Screenshot-3

Ooookay…. that’s a bit more like it. The nodes kind of bounce around until you drag them to where you want them on the screen and clicking on each one brings up a property list.  You can see that there’s also a set of “Person” nodes defined in the legend, which exist in the database because of some previous training code I’ve run.  If I’d run “MATCH (ee) RETURN ee;” then they would have shown up as a separate network, but because I specified that I want nodes of type “Employee”, I only get the relevant entries.

Basically that’s pretty cool, although I suspect bringing back a much larger network would start to make it less useful.  In fact, that seems like a good prompt to talk about filtering the graph, which can be done by using (at least) a couple of different techniques around the MATCH clause. Say we want to see employees who report to the CEO only, we can do it in a kind of SQL style with a WHERE clause:

MATCH (LineReports:Employee)-[:REPORTS_TO]->(boss:Employee)
WHERE boss.name = ‘John Tuld’
RETURN LineReports;

or in a way that feels a bit truer to the pattern-matching aspect of Cypher, like so:

MATCH (LineReports:Employee)-[:REPORTS_TO]->(boss:Employee{name:’John Tuld’})
RETURN LineReports;

This gives us the same kind of visualisation (but thinned out) and in all cases there seems to be the option to view as a data table and/or export as JSON:

Neo4j-Screenshot-4

Neo4j-Screenshot-5-edited

Trust me when I say I’ve only scratched the surface of what Cypher can do there, even in terms of what I’ve managed to experiment with in the last 48 hours.  However, this post seems to be getting quite long enough already, and I hope you can already get the sense that I do, that this is potentially quite a nice language to work in.  It’s got a kind of cleanness to it, and a kind of functional-meets-imperative-meets-query feel that reminds me somewhat of LINQ (especially in the context of LINQPad). I’m definitely keen to keep trying out different features and to see how well it deals with more complex queries and a slightly greater volume of data.

One thing to notice from a SQL Server point of view (my home turf) is that there does seem to be a similar concept of cached execution plans, with the similarities extending as far as the recommended use of parameterised queries to maximise plan reuse – and it looks like there’s also a command-line profiler tool available to wring the best performance out of your queries.  More details of some of that here.  So as with a lot else in Neo4j, I would say there’s enough familiar hooks to get you started, and enough new functionality and potential to keep things interesting.

Overall – I would definitely recommend taking a look at what Neo4j can offer.  Getting a grasp of the basics doesn’t require a massive investment of time and the tools are pretty decent, although doubtless they will mature with time, and there’s already a good range of client libraries/ drivers available.  As and when I get a chance, I’ll probably take a look at some of these and also at the crucial question of how well it handles fully automated data imports from other sources.