Performance in Neo4j (“Obligatory Neo4j” Part 4)

This is my fourth post on the popular open-source graph database Neo4j, and I’ll be taking a look at some ideas about how performance can best be managed.  To see the full series, you can check out the Neo4j tag on this blog - alexdgarland.com/category/neo4j/.

It’s been about a month since my last post.  In the meantime I’ve been busy, playing around with C/C++, Hadoop & Linux – but it’s definitely time to get back to looking at Neo4j.   This is the final part of my “Obligatory Neo4j” series – although almost certainly not my last word on the topic of Neo4j – and we’re now at the point where we should be able to picture some of the functional aspects of a system built on graph technology.  The question is, can we make it run quickly and efficiently?

As ever, there are already some good resources available from Neo Technology.  This includes not only the Graph Databases book, but also this video on Cypher query optimisation from Wes Freeman and Neo’s Mark Needham:

One thing that comes out of that video very clearly for me is that the query optimiser for Neo4j – while already useful – is going through a process of evolution that the equivalent functionality in relational databases such as SQL Server and Oracle more-or-less completed years ago.  When it comes to things like the ordering of WHERE clauses and the placement of filters in the MATCH or WHERE parts of the query, developers using Neo4j need to handle tuning in quite a manual way.

This differs from, say, SQL Server, where the cost-based optimiser will take a declarative query, evaluate various logically equivalent ways of running it and typically reach a “good enough” execution plan without detailed input from the user.  The statement in the video that “soon Cypher will do a lot of this for you” suggests that Neo4j will be heading in the same direction in future versions.

I’d tend to look on the bright side regarding this stage in Neo4j’s development – it’s an interesting time to start looking at the product, most performance tuning cases should not be too difficult even now and as the optimiser inevitably becomes more functional, it will only get easier, while also leaving anyone who has manually optimised queries with a deeper insight into how the database engine works.

At this point, I’ll note a few principles that are very familiar from working with SQL Server:

  • Queries can typically be made to perform better by reusing execution plans (including parameterisation of variables in submitted Cypher calls, which also protects against query injection).
  • More RAM is better – while it’s not a silver bullet, adding more memory to a server will almost always help performance, allowing better caching of plans and more head-room for processing and storing results.
  • Queries that perform scans of entire data structures (relational tables, graphs, labels etc.) are generally to be avoided if the results you want don’t include the entire data set.  Some form of selective seek or lookup is likely to be quicker and less resource-intensive.
  • These selective actions can be made easier by following a smart indexing strategy.

Diving deeper though, the differing nature of graph storage starts to come into play.  There are specific characteristics to the work that Neo4j has to do to return certain queries, which can be advantageous in some circumstances.

The key differences come from a divergence in the way data is structured on disk and in memory.  A relational database is built around tables (“relations”) as fundamental structures and, while indexes may help optimise particular paths to access the data, they are still typically “global” across an entire table.  Relationships between tables are navigated via join columns, which are typically included in one or more of these global indexes.  As the size of the tables grows, the time and cost to navigate them is likely to increase significantly, even with a good indexing strategy in place.

Neo4j takes a different approach, structuring data in such a way that relationships between nodes are completely first-class entities and aiming to achieve “index-free adjacency”.   The implementation basically takes the form of pointer arithmetic over contiguous arrays of fixed-size primary records, allowing constant-time random access to nodes and relationships and then linking to other resources such as properties as required.  A lot more detail on this model and the advantages it can have is available in Chapter 6 of the Graph Databases book.

Setting aside the fine detail, it all leads towards one thing.  That is to say, once one has found an entry point to the graph, the cost of accessing related nodes is proportional not to the total size of the graph database (which can be very large) but to the size of the area of the graph containing the data you’re actually interested in.  The potential advantages of this approach should be obvious.

Nothing comes for free though.  To reliably get this kind of high, highly-scalable performance, it is necessary to design and query your Neo4j database in a way which works well with the underlying graph model and the associated data structures.

The resources provided by Neo Technology give some good hints on how to achieve this, but the following are definitely worth considering:

  • Make sure you understand the limits you’re able to place on the parts of the graph that will be used and searched.
  • Make sure you have explicitly created the index or indexes you need to support this restriction.
  • Check that executed Cypher queries actually behave in the way you expect; it sounds like the “profile” command could be helpful with this.
  • Understand how different parts of your query work in isolation and consider piping only the required results between them using the WITH keyword.
  • Minimise the use of variable-length paths in pattern matches – e.g. “(node1)-[:RELATIONSHIP*1..5]-(node2)”.  While these make queries a lot more flexible and robust in the face of change, the database engine is forced to check more possible paths and hence do more work in executing the query.
  • If you want to have a variant of a relationship – e.g. home vs business address – there are two ways to do it:
    • adding a property to each relationship – “-[:ADDRESS {type:"Home"}]-”
    • adding two different relationship types – “-[:HOME_ADDRESS]-” and “-[:WORK_ADDRESS]-”

Because the primary relationship type is stored directly in the structure representing the graph, whereas properties are stored at one remove, the second option will typically perform better.

  • Add additional relationships.  Complex and multiple interactions between nodes are much easier to add in Neo4j than in a relational database, so in the above example we could theoretically have both sets of relationships, as long as we’re sure they’re useful and we don’t mind the extra overhead of coding and maintenance.

As a specific performance optimisation, Neo Technology actually recommend adding direct relationships between nodes which duplicate the information implicit in less direct relationships.  For example, if two people both worked at the same company, rather than having to go via the company node every time we can add a direct relationship “-[:WORKED_WITH]-”.  These secondary sets of  relationships can be maintained asynchronously (as a frequently scheduled update job) in order to minimise the impact on initial write times when a node is created.

Beyond this list for working at a query level, there are other, lower-level options for performance improvement.  As mentioned in my last post, using the Traversal, Core or Kernel API rather than Cypher should allow progressively more focussed tuning of specific processes.

There are also some architectural and hardware-based options.  The commercially licensed version of Neo4j offers the possibility of master-slave replication for horizontal read-scaling, and a technique called cache-sharding can be used which increases the chance of required data being queued up in main memory.  There are more details on that here.

What has to be remembered when scaling out or considering performance more generally is that Neo4j is – in terms of NoSQL and the CAP Theorem – a well-performing but ultimately consistency-driven (roughly speaking, “CP”) system.  Read-scaling is one thing; high-end write scalability is an inherently hard problem which may better suit an availability-driven (“AP”) system using a key-value, document or column-oriented model.  These kind of systems explicitly accept a looser (“eventual”) definition of consistency in return for the ability to run across a very large number of distributed servers, and make choices about how to deal with the complexity and uncertainty that that creates.  Where this kind of extreme distribution of data storage (combined with high availability) is not required, Neo4j offers many other benefits in terms of greater expressivity, reliability and relative ease of use – and in most cases it seems like it should be able to perform very well if appropriately managed and tuned.

Talking To The Graph (“Obligatory Neo4j” Part 3)

This is my third post on the popular open-source graph database Neo4j, and I’ll be taking a look at APIs.  To see the full series, you can check out the Neo4j tag on this blog - alexdgarland.com/category/neo4j/.

First of all, a quick reminder of our “set text”, the Graph Databases e-book from O’Reilly and Neo Technology.  Free, concise, full of useful info – what’s not to like?  Download it here.

To recap then… we’ve had a look at Neo4j and the Cypher query language.  We like what we see.  Expressive language, great possibilities for data modelling and a browser interface that draws us pretty pictures on demand:

Neo4j-Screenshot-3       Neo4j shopping data model 2

Awesome!  So all we need to do now is install Neo4j on a server somewhere and give our business users a big text file full of common Cypher queries.  That can’t possibly go wrong!

Sorry, what’s that?  We need to integrate the database with existing in-house systems built in some kind of statically typed, object-oriented language like Java or C#?  And the web team are saying they’ll need to connect from their back-end servers using something lightweight and dynamic like Ruby or Python.  Suddenly things are looking slightly more complicated…

Thankfully, this is not a problem as Neo4j offers a variety of ways to connect from other languages and systems.  The main route in is via the REST API, which (as you would expect) can be called from any language that can piece together the requisite lump of JSON and fire it across an HTTP connection.

There’s actually at least a couple of different ways to structure REST calls, but the best way is typically going to be to send actual Cypher queries as part of the JSON payload.  This is because Cypher and a declarative approach to data interaction are powerful, well-supported and seen as a big part of the roadmap for future Neo4j development.  The queries sent can be parameterised for better performance and can be submitted either via the standard Cypher endpoint or via the Transactional endpoint, which allows an RDBMS-like atomic transaction to be scoped across multiple consecutive HTTP requests.

Naturally, users of various languages have decided that they want to consolidate their use of the REST API into libraries, and many of these are freely available, with links collected on the Neo4j website.  A lot of common languages and frameworks – Ruby/Rails, Python/Django, .Net, PHP – get at least a couple of options each, and there’s even one for Haskell if you’re feeling particularly masochistic.

neo4jrestrubypythondotnet

phpnodejsclojurehaskell

I was actually planning to look quickly at options for a couple of different languages that I’m somewhat familiar with – a .NET one using C#, and one of the two available for Python.  The idea was that I could do different graph models which would have some relevance to the way the languages are perceived – something “Enterprisey” for .NET, and for Python probably some terrible joke about Spam.

It turned out though that once I got started coding in C#, I spent longer than I expected messing around with associating nodes with classes and trying to work out how I might use it in a real-world situation.  Always the way!  So the Python will have to wait, and in the meantime, courtesy of my new GitHub account, here is some code for an ** extremely ** basic implementation of an “Enterprise Widget Manager”, persisted in Neo4j.

So first of all, the library itself.  Simply because it was nearer the top of the list, I chose Neo4jClient over CypherNet for my first attempt.  Neo4jClient is the product of Tatham Oddie and Romiko Derbynew of .Net consultancy Readify.  It installs straightforwardly via Nuget (always a bonus) and should work with any other language that can run on the .NET CLR – it has certainly been tried out with F#.

Overall, it seemed to work out pretty well for me.  I managed to come out with some code which – while not especially elegant or sophisticated in terms of my own contribution – did run the Cypher queries I wanted to against my install of Neo4j on localhost:7474.  The basic idea was to create a simple parts diagram (components used in multiple assemblies) – the sort of thing which, when scaled up, probably does lend itself better to graph modelling than to a relational database.

Neo4j WidgetManager Graph

The program runs as a simple console app, creating the graph pictured above (obviously I had to connect through the browser to get the visual version), then querying for and displaying a list of components used to make both “Widgets”:

Neo4j Widget CMD

I found the documentation was pretty helpful for getting up and running fast – it makes it very clear that you should read everything through properly before getting into any serious coding, but also gave me enough examples and hints to make a quick start on trying it out.  I’m sure a lot of the other third-party connection options will also be really good and may have a look at them in future.

Going Deeper

Some of you who are really observant may have noticed that, apart from one throwaway reference, I haven’t mentioned the possibility of connecting from Java.  This isn’t because it isn’t possible – quite the opposite.  Neo4j is implemented in Java, and in fact was originally designed as an embedded database for applications running on the JVM.  So if you want to get closer to the underlying database engine – whether for performance reasons or to tweak some functional aspect of the system that Cypher and the default REST API don’t give you access to – you’ll need to be comfortable working with the Java language, or at least something like Clojure which compiles to JVM Bytecode and can communicate directly with Neo4j in embedded mode.

There are three additional Neo4j APIs that Java provides access to:

  • The Traversal API
  • The Core API
  • The Kernel API

Moving down that list, each API gets further from the expressive, declarative modelling approach exemplified by Cypher, but in return allows you to work closer to the metal and permits a greater degree of fine-tuning and performance tweaking.

You can use these APIs when running the database in embedded mode, or there’s also the option to write custom “server extensions” to the REST API, using the Java APIs to redefine behaviour in response to specific REST calls.

The final thing you can do with Java is hack the code base of Neo4j itself.  It’s open source and Neo’s own Max De Marzi provides a great example of how to take advantage of that here.

Now With GitHub + Pointers

I’ve just set up an account on GitHub – not much to see on there other than one very basic C source file, from one of the exercises in Chapter 1 of Kernighan & Ritchie – but watch this space.

Why GitHub?

Pretty normal reasons I guess:

  • The amount of (stricly non-work-related) code snippets I’ve been emailing back and forth between home and work was getting a bit silly.
  • If I ever do put anything decent on there it acts as a portfolio of sorts, and its nice to be able to discuss code by linking to fully versioned examples.
  • I use TFS at work and it seems like a good idea to see what else is out there.  I’m by no means a Microsoft hater but it’s not hard to suppose they don’t get everything right.

Why C/C++?

That’s a slightly more difficult one, or at least more a matter of personal preference.  I of course have a ton of cool languages and technologies I want to get to grips with, so with the modern trend being towards higher-level, more expressive idioms, why am I starting to learn a language where you don’t even get a boolean data type out of the box?

Well, to keep it short and sweet, I really like to understand how things work in detail.  I have to admit to being somewhat influenced by Joel Spolsky’s vintage posts on leaky abstractions and the perils of JavaSchools – macho nerd-elitism aside, a vast amount of the software we use today is in some way built on top of C/C++.  Relational databases, operating systems, frameworks and VMs, IDEs – all the things that are really fundamental to what we do as programmers tend to need at least some access to the low-level power and control that direct allocation of bytes provides.

Just using these tools day-to-day, I’m sure I can benefit from understanding their implementation a little better, and even in today’s declarative, cheap-hardware world, I suspect there are times when JIT-compiled or interpreted languages just aren’t going to cut it.

And even if there aren’t, it’s a good intellectual challenge, which is kind of what we’re all after anyway, isn’t it?

Graph Data Modelling (“Obligatory Neo4j” Part 2)

First of all – a quick “thank you” to the guys at Neo Technology and anyone else who commented on/ retweeted my last post.  It feels like there’s a friendly community around Neo4j and – let’s be honest – that’s as important as the software itself.

Secondly, I have further posts planned – this one is on data modelling, then I’m aiming to look at APIs and drivers, followed by internals and performance.  All will be posted to my Twitter feed as they are published.  Throughout, I’ll likely refer to the e-book “Graph Databases”, available for free here.  While it does focus heavily on Neo4j in parts, it also provides good general coverage of graph theory, graph databases and the wider NOSQL movement – all in only 200 pages.

cropped-graphdatabases_cover390x5121

So, to get on-topic – one of the major claims Neo4j make in the book is that their product has greater “Domain Affinity” than relational databases.  That is to say, it makes thinking about the real-world problem and thinking about the database you’ll use to model it much more like the same thing.

There are a couple of ways they go at this.  The first is to talk about “accidental complexity” in relational modelling – things that are necessarily part of the database but not the domain.  Think foreign keys and (especially) separate tables representing many-to-many relationships.  Working with relational technology every day, we may forget how counter-intuitive these techniques can be on first exposure.  We blur the line between relationships and properties, rather than treating the relationships as first-class concepts in their own right.

They also argue that performance considerations push us towards denormalisation of relational databases, which makes the data model less flexible in the face of change.  Given the challenging nature of relational schema changes in production, this can cause big problems.  Neo4j, conversely, prides itself not just on facilitating better data models, but on allowing them to change in a more agile, iterative way.

While the abstract concept of “accidental complexity” certainly interests me, the more straight-forward way they explain it is that you can go straight from white-board sketches of your business domain to the “ASCII art” structure of the Cypher query language, without really having to re-think or re-factor what you’ve already worked out.  That, I thought, sounds like a challenge!

Here, then, is a small, quick-and-dirty domain model I drew up, representing some sort of generic retail data set:

Neo4j shopping data model 1

I was then able to convert it into the following code:

CREATE (bob:Customer {name:”Bob”}), (jim:Customer {name:”Jim”}),
(leedsstore:Store {branch:”Leeds”}),
(york:City {name:”York”}), (manchester:City {name:”Manchester”}),
(yorkshire:County {name:”Yorkshire”}), (greatermanchester:County {name:”Greater Manchester”}),
(bob)-[:SHOPS_AT]->(leedsstore), (jim)-[:SHOPS_AT]->(leedsstore),
(leedsstore)-[:LOCATED_IN]->(yorkshire),
(bob)-[:LIVES_IN]->(york), (york)-[:LOCATED_IN]->(yorkshire),
(jim)-[:LIVES_IN]->(manchester), (manchester)-[:LOCATED_IN]->(greatermanchester);

The resulting graph displays in the Neo4j browser like so:

Neo4j shopping data model 2

I would have to agree that in the process of doing this, the challenge was to get the “real-world” model right, not to code it into Neo4j.  The step I often have to go through when working with SQL Server – of “this is how it works, now what are the required tables?” – was a non-issue – my only question was whether I had the right data in there for my business queries.

In fact, I noticed while looking back at the model that I could easily have put more detail into the address hierarchy; this is something that should be relatively easy to fix in Neo4j, and indeed they talk about “cross-domain” modelling where you can have (for example) a detailed set of geographical data nodes/ relationships which interact with other data elements as required.

Taking the model a bit further, suppose I want to look at customers who shop at the Leeds store but don’t live in the same county.  Maybe it’s my only store in the North of England and has been doing well, I want to open another branch and need ideas on where to locate it.  This again results in some fairly intuitive Cypher code, which will return customer “Jim” and city “Manchester”:

MATCH (cust:Customer)-[:SHOPS_AT]->(:Store{branch:”Leeds”})-[:LOCATED_IN*1..4]->(c_str:County),
(cust)-[:LIVES_IN]->()-[:LOCATED_IN*0..3]->(city:City)-[:LOCATED_IN]->(c_cust:County)
WHERE c_str <> c_cust
RETURN cust, city;

Note the variable-length paths (e.g. “[:LOCATED_IN*1..4]“) - these mean that if I do later add more address details to the model, the query will work fine as long as all the connections between the store/ customer and the county are of the same type.  Variable-length paths often don’t perform especially well, however, so I’ve put a low upper bound on the variability and am only using them because I know they may be needed in this case.

I can also do a more useful reporting-style query to produce a list of cities, with numbers of customers as an aggregate and used to give me a top 10 in descending order:

MATCH (cust:Customer)-[:SHOPS_AT]->(:Store{branch:”Leeds”})-[:LOCATED_IN*1..4]->(c_str:County),
(cust)-[:LIVES_IN]->()-[:LOCATED_IN*0..3]->(city:City)-[:LOCATED_IN]->(c_cust:County)
WHERE c_str <> c_cust
WITH city, COUNT(*) AS CustomerCount ORDER BY COUNT(*) DESC
RETURN city, CustomerCount LIMIT 10;

With the tiny example data set I’ve created, this simply tells me there is one matching customer in Manchester – but with realistic production data, it could give me actionable information pretty fast.

We can see here though that the query code is already getting slightly more complicated – to get a top 10 by customer count, I’m having to use the WITH clause, which pipes results through for further querying.  There are examples significantly more complex than this in the e-book (notably on page 132), and doubtless even more so in some production codebases.  While I love the steps Neo Technology are taking to create a language in which data modelling can be more intuitive, I’d be a bit wary of assuming Cypher queries can remain completely “whiteboard-friendly” under situations of increasing complexity.

As with SQL – which, in the main, is also a fairly intuitive, “plain-English” kind of language – a deeper understanding seems to be needed as detailed, accurate querying and good performance become more important.  I’ll maybe look at this in more detail in my post on performance and internals, but there are definitely cases where your implementation data model needs to be somewhat guided by knowledge of the underlying platform.  For example, if you have variations on the same relationship type (i.e. addresses for home and business), it’s more efficient to model separate -[:HOME_ADDRESS]-> and -[:BUSINESS_ADDRESS]-> relationships than to use properties – i.e. -[:ADDRESS {type:"Home"}]->, because of the way the underlying storage works.  This is completely manageable (the book has more details – pages 65-66) but certainly needs taking into consideration, and does have knock-on effects on what a good data model needs to look like.

Neo4j themselves are also refreshingly up-front about the importance of avoiding “lossy” data models – see the e-mail example on pages 50-61.

Before I wrap up – just one more constructive comment around the otherwise smooth process of Neo4j modelling and querying.  As far as I’ve been able to tell, the database engine doesn’t have any real native support for dates, which leaves you with three options:

  • Integer datestamp properties (fast performance but opaque at the database level, although should be easily convertible by any client code)
  • ISO8601 string representations as properties (clear but bulkier and perform less well)
  • A domain of nodes representing a time hierarchy (somewhat complex to set up and long-winded to query but apparently does offer some modelling and/or performance benefits – see pages 70-71 of the e-book)

I don’t think the question of how to handle dates should put anyone off using Neo4j but – especially given the potential for graph modelling to be appreciated by analysts and business users as well as software specialists – it would be great if it could become even easier to use dates directly in Cypher at some point in the future.

Overall, Neo4j is clearly set up to exploit some very natural ways of modelling data, and to build upon the mature and well-understood field of Graph Theory.  I haven’t even mentioned the fascinating concept of “Triadic Closure” – which you can read about in Chapter 7 or see in this video by Neo’s Jim Webber - or the built-in support for shortest-path algorithms such as Dijkstra and A*.  As per my last post, I can only really sign off with a strong recommendation to check Neo4j out and a hope that – in Jim’s words – the roadmap for the future is “more awesomeness”.

The Obligatory “I’m Trying Neo4j” Post (Part 1)

I remember reading someone (more-or-less-jokingly) suggesting that all blog software should come with a default first post entitled “I’m Trying Linux”.  That was a few years ago though, and while open-source operating systems are still a topic of interest, the big buzz amongst cutting-edge nerds in the last couple of years has of course been around NoSQL databases.

There are numerous relatively straightforward document/ key-value store options – which to simplify grossly, can be said to do the same sort of thing as relational databases, only with greater scalability, less impedance with object-oriented programming models and typically less strict guarantees on ACID properties.  But then there is also the one area which seems to really grab the attention of the old-school database crowd, because it offers not just to solve some of the more obvious issues with the relational model when applied to the web and OOP, but also to open up new possibilities in terms of data modelling.  That is to say, graph databases.

The clear leader in terms of mentions and mind-share in the graph space at the moment seems to be Neo4j.  Just in the last couple of days, I’ve had a couple of colleagues express an interest and noticed a post on the topic on one of the sites I read regularly, so now seemed like as good a time as any to finally install the free community edition and run it through its paces.

Just to be clear, I’m not going to provide an exhaustive survey of Neo4j or give instructions to get up and running – there’s a ton of stuff out there for that, a lot of which can be found simply by going to the Neo4j website.  Rather, I’m going to try and comment on what I think is interesting and/or useful about it as a tool. As per the title, I consider this very much the first part of many, so watch this space if you want to see more.

Suffice it to say, installation is extremely quick and easy and once running, you can connect by simply pointing your browser at localhost:7474, which gets you to the following:

Neo4j-Screenshot-1

See that tiny white bar near the top of the screen with the “$” sign to the left?  That’s actually a query window, and through it you can interface directly with Neo4j using their own declarative query language, Cypher.

The install comes with a set of training presentations (which you can also see the links to in the above screenshot), but having run through the basic examples I decided to do my own thing rather than dive into their “Movie Graph” code.  Just a matter of different learning styles I guess, but I find that pushing myself to think up something to program can often (not always) help to get a firmer grasp of a language’s core features faster.

I was watching the excellent “Margin Call”, which is set in a large Wall Street firm at the beginning of the 2008 financial crisis, but focused on a handful of key employees as information about the coming crisis makes its way up the organisational hierarchy.  Employee networks?  Political alliances?  That sounds basically like graph-node stuff, the kind of thing that you’d end up doing as some kind of bastardised adjacency list structure in any standard relational system – yep, that’ll do.

So first of all let’s create some employees, with a Cypher clause called (unsurprisingly) “CREATE”:

CREATE (Seth : Employee {name : ‘Seth Bregman’, title : ‘Junior Risk Analyst’}),
(Peter : Employee {name : ‘Peter Sullivan’, title : ‘Senior Risk Analyst’}),
(Eric : Employee {name : ‘Eric Dale’, title : ‘Head of Risk Management’}),
(Will : Employee {name : ‘Will Emerson’, title : ‘Head of Trading’}),
(Sam : Employee {name : ‘Sam Rogers’, title : ‘Head of Sales and Trading’}),
(Jared : Employee {name : ‘Jared Cohen’, title : ‘Head of Capital Markets’}),
(Sarah : Employee {name : ‘Sarah Robertson’, title : ‘Chief Risk Management Officer’}),
(John : Employee {name : ‘John Tuld’, title : ‘Chief Executive Officer’}),
(Seth)-[:REPORTS_TO]->(Eric),
(Peter)-[:REPORTS_TO]->(Eric),
(Eric)-[:REPORTS_TO]->(Will),
(Will)-[:REPORTS_TO]->(Sam),
(Sam)-[:REPORTS_TO]->(Jared),
(Jared)-[:REPORTS_TO]->(John),
(Sarah)-[:REPORTS_TO]->(John);

With some combination of intuition and RTFM, it should be pretty clear what’s happening there – we’re creating a node variable for each employee, with some key-value pairs to set up properties (name, title), and then adding a “reports to” relationship where appropriate.  Note at this point that relationships can also have properties and be assigned to variables using similar syntax as for nodes, although I haven’t taken advantage of it at this point.  Now to fire it off:

Neo4j-Screenshot-2

So okay – that’s definitely done something – but I thought Neo4j was supposed to give me a visualisation of my graph? Looks like I need to use the MATCH clause to find nodes (and note the use of RETURN – we have to do something with what the MATCH gives us, and in this case we want to get data back):

MATCH (ee:Employee)
RETURN ee;

Neo4j-Screenshot-3

Ooookay…. that’s a bit more like it. The nodes kind of bounce around until you drag them to where you want them on the screen and clicking on each one brings up a property list.  You can see that there’s also a set of “Person” nodes defined in the legend, which exist in the database because of some previous training code I’ve run.  If I’d run “MATCH (ee) RETURN ee;” then they would have shown up as a separate network, but because I specified that I want nodes of type “Employee”, I only get the relevant entries.

Basically that’s pretty cool, although I suspect bringing back a much larger network would start to make it less useful.  In fact, that seems like a good prompt to talk about filtering the graph, which can be done by using (at least) a couple of different techniques around the MATCH clause. Say we want to see employees who report to the CEO only, we can do it in a kind of SQL style with a WHERE clause:

MATCH (LineReports:Employee)-[:REPORTS_TO]->(boss:Employee)
WHERE boss.name = ‘John Tuld’
RETURN LineReports;

or in a way that feels a bit truer to the pattern-matching aspect of Cypher, like so:

MATCH (LineReports:Employee)-[:REPORTS_TO]->(boss:Employee{name:’John Tuld’})
RETURN LineReports;

This gives us the same kind of visualisation (but thinned out) and in all cases there seems to be the option to view as a data table and/or export as JSON:

Neo4j-Screenshot-4

Neo4j-Screenshot-5-edited

Trust me when I say I’ve only scratched the surface of what Cypher can do there, even in terms of what I’ve managed to experiment with in the last 48 hours.  However, this post seems to be getting quite long enough already, and I hope you can already get the sense that I do, that this is potentially quite a nice language to work in.  It’s got a kind of cleanness to it, and a kind of functional-meets-imperative-meets-query feel that reminds me somewhat of LINQ (especially in the context of LINQPad). I’m definitely keen to keep trying out different features and to see how well it deals with more complex queries and a slightly greater volume of data.

One thing to notice from a SQL Server point of view (my home turf) is that there does seem to be a similar concept of cached execution plans, with the similarities extending as far as the recommended use of parameterised queries to maximise plan reuse – and it looks like there’s also a command-line profiler tool available to wring the best performance out of your queries.  More details of some of that here.  So as with a lot else in Neo4j, I would say there’s enough familiar hooks to get you started, and enough new functionality and potential to keep things interesting.

Overall – I would definitely recommend taking a look at what Neo4j can offer.  Getting a grasp of the basics doesn’t require a massive investment of time and the tools are pretty decent, although doubtless they will mature with time, and there’s already a good range of client libraries/ drivers available.  As and when I get a chance, I’ll probably take a look at some of these and also at the crucial question of how well it handles fully automated data imports from other sources.

Purity & Pragmatism

So okay… it’s Friday night, and while all and sundry are out making merry, I’m going to sit in and try to articulate some thoughts on the tension between theory and practice in Software Engineering. Never let it be said that I don’t know how to rock out…

I got my start in programming in a very practical and unglamorous way. I was working in a data analysis role a few years ago and the other team members had set up regular processes using VBA in Access and Excel, so I learnt how to hack my own modules together in an inelegant but more-or-less effective way. The code I produced was atrocious by any objective standard, but what it showed me was that a basic grasp of syntax could already make me far, far more productive. Oh and also that running Excel macros with screen updates turned on looks awesome.

I fairly quickly became a lot more serious about learning both languages/ tools and the underlying theory. Something just clicked with me about applying fundamental rules and approaches to solving technical problems, building up on a physical level from bits and bytes and on a logical level from predicates, transformations and carefully defined relationships that attempted to model the real world in code. I find it very rare nowadays that I come across a piece of technology for which I don’t want to know the roots and the implementation details, and I spend a lot of time trying to prioritize the many, many excellent books, articles and academic papers that demand my limited time and attention.

(Picture of lots of technical books)

(I’ve read maybe half of these, and have at least another twenty on my Amazon wish list)

So yes, the direction of travel for me has been almost exclusively towards greater theoretical understanding and in-depth knowledge. I hope I’m never arrogant enough to imply that I know it all, but I definitely find some frustration in situations where other people are not so much unable as unwilling or uninterested in following the same path. There’s a flip-side to that approach though, and developers like myself who take learning and personal development seriously need to be careful not to lose sight of the bigger picture.

I absolutely don’t want to name names, but I was just reading an article from a year or so ago where one SQL expert of some years’ standing was commenting on the work of another venerable guru of the database movement. Both men are clearly immensely smart and have contributed a lot to the field, but the tone and chosen topics of dispute left the unfortunate impression that one if not both of them were more interested in a) their own particular, very specific mode of data representation and manipulation and b) scoring points, than in making their ideas understood more widely and helping other people to work with relational databases.

I take this seriously because I recognise it in myself. Even as a working programmer, it’s incredibly easy to get sucked into this kind of intellectual one-upmanship. If I’m completely honest, some part of this attitude has its place. I’m sure the vast majority of professionals in our industry have experienced the constant pressure to deliver what the business wants as fast and as cheaply as possible, piling up and ignoring technical debt until software becomes hard to maintain and risky to change. We do have a responsibility to apply some polite-but-firm pressure in the opposite direction, insisting on standards, software craftmanship and investment in creating a quality product that will stand the test of time. An obsessive focus, pride and attention to detail are necessary parts of this.

The point though, is as follows. We need to be working with, not against, the businesses we are a part of. We need to be responsive to their needs and understand the way in which every function of the organisation – technical and otherwise – is dependent on every other part in order to operate. It isn’t selling out if we accept that building something great, something that has a real impact, takes compromise and the maturity to understand other people’s points of view.

Whenever you came into software development, even if you’re decades in to the game and were firing your way through recursion and pointer arithmetic in your early teens, you must be able to think back and remember when you had no idea how a computer worked. To all of us, born in the shadow of Alan Turing and the early greats of Computer Science, they were once just magic boxes that did amazing things. It’s important to remember that that perspective still makes sense – the things that computers can do have the greatest value because they can impact everyone, not just the smartest, most technically clued-up people in the room. By all means, get stuck in to an intense debate about the relative merits of your favourite language, framework or feature (in fact, CC me or send me a link – I’m sure I can learn something from it) – but do remember why it is that anyone else takes this stuff halfway seriously in the first place.

Newness

So here we are.  A new domain, a brand spanking new WordPress blog.

I’m not going to have much to say today but past experience of blogging and writing more generally has taught me that momentum is key.  Get something down on – well, not literally paper, but you get the picture – at all costs, the rest will follow on.

I will of course try and make my posts here entertaining and thought-provoking, although your mileage may vary depending upon whether your interests converge with mine.  If they don’t…. what exactly was it you were doing here in the first place?

For the record, I expect to be talking mostly about programming and Computer Science, with some light relief in the form of music, culture etc. and some change-is-as-good-as-a-rest earnestness regarding politics, philosophy and other such weighty matters.

Or, you know, I could just post up a load of cat pictures – whatever’s good ;-)