Now With GitHub + Pointers

I’ve just set up an account on GitHub – not much to see on there other than one very basic C source file, from one of the exercises in Chapter 1 of Kernighan & Ritchie – but watch this space.

Why GitHub?

Pretty normal reasons I guess:

  • The amount of (stricly non-work-related) code snippets I’ve been emailing back and forth between home and work was getting a bit silly.
  • If I ever do put anything decent on there it acts as a portfolio of sorts, and its nice to be able to discuss code by linking to fully versioned examples.
  • I use TFS at work and it seems like a good idea to see what else is out there.  I’m by no means a Microsoft hater but it’s not hard to suppose they don’t get everything right.

Why C/C++?

That’s a slightly more difficult one, or at least more a matter of personal preference.  I of course have a ton of cool languages and technologies I want to get to grips with, so with the modern trend being towards higher-level, more expressive idioms, why am I starting to learn a language where you don’t even get a boolean data type out of the box?

Well, to keep it short and sweet, I really like to understand how things work in detail.  I have to admit to being somewhat influenced by Joel Spolsky’s vintage posts on leaky abstractions and the perils of JavaSchools – macho nerd-elitism aside, a vast amount of the software we use today is in some way built on top of C/C++.  Relational databases, operating systems, frameworks and VMs, IDEs – all the things that are really fundamental to what we do as programmers tend to need at least some access to the low-level power and control that direct allocation of bytes provides.

Just using these tools day-to-day, I’m sure I can benefit from understanding their implementation a little better, and even in today’s declarative, cheap-hardware world, I suspect there are times when JIT-compiled or interpreted languages just aren’t going to cut it.

And even if there aren’t, it’s a good intellectual challenge, which is kind of what we’re all after anyway, isn’t it?

Graph Data Modelling (“Obligatory Neo4j” Part 2)

First of all – a quick “thank you” to the guys at Neo Technology and anyone else who commented on/ retweeted my last post.  It feels like there’s a friendly community around Neo4j and – let’s be honest – that’s as important as the software itself.

Secondly, I have further posts planned – this one is on data modelling, then I’m aiming to look at APIs and drivers, followed by internals and performance.  All will be posted to my Twitter feed as they are published.  Throughout, I’ll likely refer to the e-book “Graph Databases”, available for free here.  While it does focus heavily on Neo4j in parts, it also provides good general coverage of graph theory, graph databases and the wider NOSQL movement – all in only 200 pages.


So, to get on-topic – one of the major claims Neo4j make in the book is that their product has greater “Domain Affinity” than relational databases.  That is to say, it makes thinking about the real-world problem and thinking about the database you’ll use to model it much more like the same thing.

There are a couple of ways they go at this.  The first is to talk about “accidental complexity” in relational modelling – things that are necessarily part of the database but not the domain.  Think foreign keys and (especially) separate tables representing many-to-many relationships.  Working with relational technology every day, we may forget how counter-intuitive these techniques can be on first exposure.  We blur the line between relationships and properties, rather than treating the relationships as first-class concepts in their own right.

They also argue that performance considerations push us towards denormalisation of relational databases, which makes the data model less flexible in the face of change.  Given the challenging nature of relational schema changes in production, this can cause big problems.  Neo4j, conversely, prides itself not just on facilitating better data models, but on allowing them to change in a more agile, iterative way.

While the abstract concept of “accidental complexity” certainly interests me, the more straight-forward way they explain it is that you can go straight from white-board sketches of your business domain to the “ASCII art” structure of the Cypher query language, without really having to re-think or re-factor what you’ve already worked out.  That, I thought, sounds like a challenge!

Here, then, is a small, quick-and-dirty domain model I drew up, representing some sort of generic retail data set:

Neo4j shopping data model 1

I was then able to convert it into the following code:

CREATE (bob:Customer {name:”Bob”}), (jim:Customer {name:”Jim”}),
(leedsstore:Store {branch:”Leeds”}),
(york:City {name:”York”}), (manchester:City {name:”Manchester”}),
(yorkshire:County {name:”Yorkshire”}), (greatermanchester:County {name:”Greater Manchester”}),
(bob)-[:SHOPS_AT]->(leedsstore), (jim)-[:SHOPS_AT]->(leedsstore),
(bob)-[:LIVES_IN]->(york), (york)-[:LOCATED_IN]->(yorkshire),
(jim)-[:LIVES_IN]->(manchester), (manchester)-[:LOCATED_IN]->(greatermanchester);

The resulting graph displays in the Neo4j browser like so:

Neo4j shopping data model 2

I would have to agree that in the process of doing this, the challenge was to get the “real-world” model right, not to code it into Neo4j.  The step I often have to go through when working with SQL Server – of “this is how it works, now what are the required tables?” – was a non-issue – my only question was whether I had the right data in there for my business queries.

In fact, I noticed while looking back at the model that I could easily have put more detail into the address hierarchy; this is something that should be relatively easy to fix in Neo4j, and indeed they talk about “cross-domain” modelling where you can have (for example) a detailed set of geographical data nodes/ relationships which interact with other data elements as required.

Taking the model a bit further, suppose I want to look at customers who shop at the Leeds store but don’t live in the same county.  Maybe it’s my only store in the North of England and has been doing well, I want to open another branch and need ideas on where to locate it.  This again results in some fairly intuitive Cypher code, which will return customer “Jim” and city “Manchester”:

MATCH (cust:Customer)-[:SHOPS_AT]->(:Store{branch:”Leeds”})-[:LOCATED_IN*1..4]->(c_str:County),
WHERE c_str <> c_cust
RETURN cust, city;

Note the variable-length paths (e.g. “[:LOCATED_IN*1..4]”) – these mean that if I do later add more address details to the model, the query will work fine as long as all the connections between the store/ customer and the county are of the same type.  Variable-length paths often don’t perform especially well, however, so I’ve put a low upper bound on the variability and am only using them because I know they may be needed in this case.

I can also do a more useful reporting-style query to produce a list of cities, with numbers of customers as an aggregate and used to give me a top 10 in descending order:

MATCH (cust:Customer)-[:SHOPS_AT]->(:Store{branch:”Leeds”})-[:LOCATED_IN*1..4]->(c_str:County),
WHERE c_str <> c_cust
WITH city, COUNT(*) AS CustomerCount ORDER BY COUNT(*) DESC
RETURN city, CustomerCount LIMIT 10;

With the tiny example data set I’ve created, this simply tells me there is one matching customer in Manchester – but with realistic production data, it could give me actionable information pretty fast.

We can see here though that the query code is already getting slightly more complicated – to get a top 10 by customer count, I’m having to use the WITH clause, which pipes results through for further querying.  There are examples significantly more complex than this in the e-book (notably on page 132), and doubtless even more so in some production codebases.  While I love the steps Neo Technology are taking to create a language in which data modelling can be more intuitive, I’d be a bit wary of assuming Cypher queries can remain completely “whiteboard-friendly” under situations of increasing complexity.

As with SQL – which, in the main, is also a fairly intuitive, “plain-English” kind of language – a deeper understanding seems to be needed as detailed, accurate querying and good performance become more important.  I’ll maybe look at this in more detail in my post on performance and internals, but there are definitely cases where your implementation data model needs to be somewhat guided by knowledge of the underlying platform.  For example, if you have variations on the same relationship type (i.e. addresses for home and business), it’s more efficient to model separate -[:HOME_ADDRESS]-> and -[:BUSINESS_ADDRESS]-> relationships than to use properties – i.e. -[:ADDRESS {type:”Home”}]->, because of the way the underlying storage works.  This is completely manageable (the book has more details – pages 65-66) but certainly needs taking into consideration, and does have knock-on effects on what a good data model needs to look like.

Neo4j themselves are also refreshingly up-front about the importance of avoiding “lossy” data models – see the e-mail example on pages 50-61.

Before I wrap up – just one more constructive comment around the otherwise smooth process of Neo4j modelling and querying.  As far as I’ve been able to tell, the database engine doesn’t have any real native support for dates, which leaves you with three options:

  • Integer datestamp properties (fast performance but opaque at the database level, although should be easily convertible by any client code)
  • ISO8601 string representations as properties (clear but bulkier and perform less well)
  • A domain of nodes representing a time hierarchy (somewhat complex to set up and long-winded to query but apparently does offer some modelling and/or performance benefits – see pages 70-71 of the e-book)

I don’t think the question of how to handle dates should put anyone off using Neo4j but – especially given the potential for graph modelling to be appreciated by analysts and business users as well as software specialists – it would be great if it could become even easier to use dates directly in Cypher at some point in the future.

Overall, Neo4j is clearly set up to exploit some very natural ways of modelling data, and to build upon the mature and well-understood field of Graph Theory.  I haven’t even mentioned the fascinating concept of “Triadic Closure” – which you can read about in Chapter 7 or see in this video by Neo’s Jim Webber – or the built-in support for shortest-path algorithms such as Dijkstra and A*.  As per my last post, I can only really sign off with a strong recommendation to check Neo4j out and a hope that – in Jim’s words – the roadmap for the future is “more awesomeness”.

The Obligatory “I’m Trying Neo4j” Post (Part 1)

I remember reading someone (more-or-less-jokingly) suggesting that all blog software should come with a default first post entitled “I’m Trying Linux”.  That was a few years ago though, and while open-source operating systems are still a topic of interest, the big buzz amongst cutting-edge nerds in the last couple of years has of course been around NoSQL databases.

There are numerous relatively straightforward document/ key-value store options – which to simplify grossly, can be said to do the same sort of thing as relational databases, only with greater scalability, less impedance with object-oriented programming models and typically less strict guarantees on ACID properties.  But then there is also the one area which seems to really grab the attention of the old-school database crowd, because it offers not just to solve some of the more obvious issues with the relational model when applied to the web and OOP, but also to open up new possibilities in terms of data modelling.  That is to say, graph databases.

The clear leader in terms of mentions and mind-share in the graph space at the moment seems to be Neo4j.  Just in the last couple of days, I’ve had a couple of colleagues express an interest and noticed a post on the topic on one of the sites I read regularly, so now seemed like as good a time as any to finally install the free community edition and run it through its paces.

Just to be clear, I’m not going to provide an exhaustive survey of Neo4j or give instructions to get up and running – there’s a ton of stuff out there for that, a lot of which can be found simply by going to the Neo4j website.  Rather, I’m going to try and comment on what I think is interesting and/or useful about it as a tool. As per the title, I consider this very much the first part of many, so watch this space if you want to see more.

Suffice it to say, installation is extremely quick and easy and once running, you can connect by simply pointing your browser at localhost:7474, which gets you to the following:


See that tiny white bar near the top of the screen with the “$” sign to the left?  That’s actually a query window, and through it you can interface directly with Neo4j using their own declarative query language, Cypher.

The install comes with a set of training presentations (which you can also see the links to in the above screenshot), but having run through the basic examples I decided to do my own thing rather than dive into their “Movie Graph” code.  Just a matter of different learning styles I guess, but I find that pushing myself to think up something to program can often (not always) help to get a firmer grasp of a language’s core features faster.

I was watching the excellent “Margin Call”, which is set in a large Wall Street firm at the beginning of the 2008 financial crisis, but focused on a handful of key employees as information about the coming crisis makes its way up the organisational hierarchy.  Employee networks?  Political alliances?  That sounds basically like graph-node stuff, the kind of thing that you’d end up doing as some kind of bastardised adjacency list structure in any standard relational system – yep, that’ll do.

So first of all let’s create some employees, with a Cypher clause called (unsurprisingly) “CREATE”:

CREATE (Seth : Employee {name : ‘Seth Bregman’, title : ‘Junior Risk Analyst’}),
(Peter : Employee {name : ‘Peter Sullivan’, title : ‘Senior Risk Analyst’}),
(Eric : Employee {name : ‘Eric Dale’, title : ‘Head of Risk Management’}),
(Will : Employee {name : ‘Will Emerson’, title : ‘Head of Trading’}),
(Sam : Employee {name : ‘Sam Rogers’, title : ‘Head of Sales and Trading’}),
(Jared : Employee {name : ‘Jared Cohen’, title : ‘Head of Capital Markets’}),
(Sarah : Employee {name : ‘Sarah Robertson’, title : ‘Chief Risk Management Officer’}),
(John : Employee {name : ‘John Tuld’, title : ‘Chief Executive Officer’}),

With some combination of intuition and RTFM, it should be pretty clear what’s happening there – we’re creating a node variable for each employee, with some key-value pairs to set up properties (name, title), and then adding a “reports to” relationship where appropriate.  Note at this point that relationships can also have properties and be assigned to variables using similar syntax as for nodes, although I haven’t taken advantage of it at this point.  Now to fire it off:


So okay – that’s definitely done something – but I thought Neo4j was supposed to give me a visualisation of my graph? Looks like I need to use the MATCH clause to find nodes (and note the use of RETURN – we have to do something with what the MATCH gives us, and in this case we want to get data back):

MATCH (ee:Employee)


Ooookay…. that’s a bit more like it. The nodes kind of bounce around until you drag them to where you want them on the screen and clicking on each one brings up a property list.  You can see that there’s also a set of “Person” nodes defined in the legend, which exist in the database because of some previous training code I’ve run.  If I’d run “MATCH (ee) RETURN ee;” then they would have shown up as a separate network, but because I specified that I want nodes of type “Employee”, I only get the relevant entries.

Basically that’s pretty cool, although I suspect bringing back a much larger network would start to make it less useful.  In fact, that seems like a good prompt to talk about filtering the graph, which can be done by using (at least) a couple of different techniques around the MATCH clause. Say we want to see employees who report to the CEO only, we can do it in a kind of SQL style with a WHERE clause:

MATCH (LineReports:Employee)-[:REPORTS_TO]->(boss:Employee)
WHERE = ‘John Tuld’
RETURN LineReports;

or in a way that feels a bit truer to the pattern-matching aspect of Cypher, like so:

MATCH (LineReports:Employee)-[:REPORTS_TO]->(boss:Employee{name:’John Tuld’})
RETURN LineReports;

This gives us the same kind of visualisation (but thinned out) and in all cases there seems to be the option to view as a data table and/or export as JSON:



Trust me when I say I’ve only scratched the surface of what Cypher can do there, even in terms of what I’ve managed to experiment with in the last 48 hours.  However, this post seems to be getting quite long enough already, and I hope you can already get the sense that I do, that this is potentially quite a nice language to work in.  It’s got a kind of cleanness to it, and a kind of functional-meets-imperative-meets-query feel that reminds me somewhat of LINQ (especially in the context of LINQPad). I’m definitely keen to keep trying out different features and to see how well it deals with more complex queries and a slightly greater volume of data.

One thing to notice from a SQL Server point of view (my home turf) is that there does seem to be a similar concept of cached execution plans, with the similarities extending as far as the recommended use of parameterised queries to maximise plan reuse – and it looks like there’s also a command-line profiler tool available to wring the best performance out of your queries.  More details of some of that here.  So as with a lot else in Neo4j, I would say there’s enough familiar hooks to get you started, and enough new functionality and potential to keep things interesting.

Overall – I would definitely recommend taking a look at what Neo4j can offer.  Getting a grasp of the basics doesn’t require a massive investment of time and the tools are pretty decent, although doubtless they will mature with time, and there’s already a good range of client libraries/ drivers available.  As and when I get a chance, I’ll probably take a look at some of these and also at the crucial question of how well it handles fully automated data imports from other sources.

Purity & Pragmatism

So okay… it’s Friday night, and while all and sundry are out making merry, I’m going to sit in and try to articulate some thoughts on the tension between theory and practice in Software Engineering. Never let it be said that I don’t know how to rock out…

I got my start in programming in a very practical and unglamorous way. I was working in a data analysis role a few years ago and the other team members had set up regular processes using VBA in Access and Excel, so I learnt how to hack my own modules together in an inelegant but more-or-less effective way. The code I produced was atrocious by any objective standard, but what it showed me was that a basic grasp of syntax could already make me far, far more productive. Oh and also that running Excel macros with screen updates turned on looks awesome.

I fairly quickly became a lot more serious about learning both languages/ tools and the underlying theory. Something just clicked with me about applying fundamental rules and approaches to solving technical problems, building up on a physical level from bits and bytes and on a logical level from predicates, transformations and carefully defined relationships that attempted to model the real world in code. I find it very rare nowadays that I come across a piece of technology for which I don’t want to know the roots and the implementation details, and I spend a lot of time trying to prioritize the many, many excellent books, articles and academic papers that demand my limited time and attention.

(Picture of lots of technical books)

(I’ve read maybe half of these, and have at least another twenty on my Amazon wish list)

So yes, the direction of travel for me has been almost exclusively towards greater theoretical understanding and in-depth knowledge. I hope I’m never arrogant enough to imply that I know it all, but I definitely find some frustration in situations where other people are not so much unable as unwilling or uninterested in following the same path. There’s a flip-side to that approach though, and developers like myself who take learning and personal development seriously need to be careful not to lose sight of the bigger picture.

I absolutely don’t want to name names, but I was just reading an article from a year or so ago where one SQL expert of some years’ standing was commenting on the work of another venerable guru of the database movement. Both men are clearly immensely smart and have contributed a lot to the field, but the tone and chosen topics of dispute left the unfortunate impression that one if not both of them were more interested in a) their own particular, very specific mode of data representation and manipulation and b) scoring points, than in making their ideas understood more widely and helping other people to work with relational databases.

I take this seriously because I recognise it in myself. Even as a working programmer, it’s incredibly easy to get sucked into this kind of intellectual one-upmanship. If I’m completely honest, some part of this attitude has its place. I’m sure the vast majority of professionals in our industry have experienced the constant pressure to deliver what the business wants as fast and as cheaply as possible, piling up and ignoring technical debt until software becomes hard to maintain and risky to change. We do have a responsibility to apply some polite-but-firm pressure in the opposite direction, insisting on standards, software craftmanship and investment in creating a quality product that will stand the test of time. An obsessive focus, pride and attention to detail are necessary parts of this.

The point though, is as follows. We need to be working with, not against, the businesses we are a part of. We need to be responsive to their needs and understand the way in which every function of the organisation – technical and otherwise – is dependent on every other part in order to operate. It isn’t selling out if we accept that building something great, something that has a real impact, takes compromise and the maturity to understand other people’s points of view.

Whenever you came into software development, even if you’re decades in to the game and were firing your way through recursion and pointer arithmetic in your early teens, you must be able to think back and remember when you had no idea how a computer worked. To all of us, born in the shadow of Alan Turing and the early greats of Computer Science, they were once just magic boxes that did amazing things. It’s important to remember that that perspective still makes sense – the things that computers can do have the greatest value because they can impact everyone, not just the smartest, most technically clued-up people in the room. By all means, get stuck in to an intense debate about the relative merits of your favourite language, framework or feature (in fact, CC me or send me a link – I’m sure I can learn something from it) – but do remember why it is that anyone else takes this stuff halfway seriously in the first place.


So here we are.  A new domain, a brand spanking new WordPress blog.

I’m not going to have much to say today but past experience of blogging and writing more generally has taught me that momentum is key.  Get something down on – well, not literally paper, but you get the picture – at all costs, the rest will follow on.

I will of course try and make my posts here entertaining and thought-provoking, although your mileage may vary depending upon whether your interests converge with mine.  If they don’t…. what exactly was it you were doing here in the first place?

For the record, I expect to be talking mostly about programming and Computer Science, with some light relief in the form of music, culture etc. and some change-is-as-good-as-a-rest earnestness regarding politics, philosophy and other such weighty matters.

Or, you know, I could just post up a load of cat pictures – whatever’s good 😉