Another post, another huge time lapse, 16 months this time.  The reasons though have been rather different this time.  I’m thinking of starting writing properly again, but kind of felt like there are a few things I wanted to mention that don’t really fit into the margins of a technical posting.

A little over 16 months ago – a few days into my new job with Sky Betting & Gaming in Leeds – my dad had a fall at home and was rushed to hospital.  He was unconscious and it became clear fairly quickly that it was serious and he was not guaranteed to recover.  I left work, travelled down to Cambridge to see him in hospital – following surgery, we were told that he wasn’t going to make it.  He never woke up and died within 48 hours.

This is obviously quite personal and not something I plan to rake over in any detail in public, but equally it’s had a huge impact on me and it would be weird not to at least acknowledge it.  In some ways I’m very much the same person I was two years ago, in others very much not.

More recently, I reached the decision that SkyBet is not where I want to spend the next chunk of my career and that a decade and a half in Leeds (since I started university there in September 2000) was enough.  In August of this year I moved down to London and am now working as Technical Lead for the Acquisition Marketing Data Engineering team at in Islington.

Which (at a very high level) brings us up to date.  More to follow, probably.

My Next Step

Once again, it’s been an awfully long time since I last posted on here. I’m actually perfectly happy putting up what I consider to be decent, in-depth technical posts a couple of times a year – I’ve tried the “quantity over quality” approach to blogging elsewhere in the past and found the cons outweighed the pros. However, it does seem to make sense to put something down today to explain what I’ve been up to over the last few months, if only because the amount I’ve had on might excuse not finding time to write!

The biggest bit of news is that after three years working as a Database Developer/ Senior Dev at Callcredit Information Group (mostly with MS SQL Server), I’m moving on to a new role. Those reading past entries here or knowing me more generally will be aware that I’ve been interested for a while in alternative database technologies and open-source software. I’ve been looking for the last year or so at how I could take some of the experience I’ve already had with technologies like Hadoop and build on it. I was approached a couple of months ago about a position with Sky Betting & Gaming, and it became clear pretty fast that they are ahead of the curve, having made a big recent investment in moving their data warehouse onto the Hadoop stack. This includes not just the standard HDFS/ Hive (+ Impala) combination, but also Spark for analytics and some other likely prospects for the future. So – I’ll be joining them for my first day tomorrow, working as a Senior Data Engineer. Exciting times ahead!


(Oh and I should probably give our HR team a quick plug – still actively recruiting for a large number of interesting technical roles in Leeds city centre.)

My other project for 2015 Q1 has been to put together and deliver a presentation on NoSQL and the modern database landscape for the “Leeds Sharp” .NET user group. This was a great challenge, as it forced me to investigate a lot of databases I’d previously had a casual interest in in a deeper, more structured way, and to really think about how to make sense of the mass of information available.

I finally presented it a few weeks ago and was really pleased both at how friendly and supportive everyone was, and at the quality of questions, comment and discussion from the audience. If anyone is thinking of doing a relevant technical talk in the area, be it something very in-depth or a quick ten minutes on a mini-topic that has excited you – Leeds Sharp is potentially a great place to do it. Good ways to contact them &/or see what they’re up to include MeetUp and Twitter – the next session is Thursday May 28th and celebrates their third birthday with a presentation on F#.


My presentation also felt like a nice way to end my time at Callcredit, who have consistently been supportive of developer training and events, and now host not one, not two but three distinct regular public events (the other two being the Leeds SQL Server User Group and Leeds Code Dojo). I was responsible (along with the inimitable and very dedicated Grant Crofton) for bringing Leeds Sharp in to host there, so it was about time I did a talk!

For those that are interested, slides and full demo code are available online here and I may also look at repeating the talk elsewhere in some form later in the year.

First(ish) Impressions of Postgres

I’ve had Postgres installed at home for quite a while now but not really explored it in detail.  As I’ve recently been digging more into open source software in general, and Python in particular (more on which later, probably), I’ve become more motivated to focus on a relational database which is itself open source and sits well within the rest of that stack.


For those that don’t know much Postgres already, you can see a bit on the general history etc. of the project here.  What originally interested me about it is that it has a reputation as being more robust, fully-featured and standards-compliant than MySQL (and/or MariaDB).

Over the last couple of weeks, I’ve started setting up Postgres as a back-end for some planned experiments with Python analytical libraries such as NumPy, SciPy and MatPlotLib.  For database management and direct query execution, I’ve been using the standard GUI tool PGAdmin, and have also installed the community edition of Pentaho BI (“Kettle”) to do some simple ETL.  Everything to date is here on GitHub and I’ll be trying to keep the repository up-to-date as I learn more.

I could of course spend ages talking about any part of that stack, but am really most keen to focus in on the Postgres database engine itself – and will do so for most of this post.

The only more general note I should make at this point is that I’ve been running all of the tools on Windows and, while they pretty much work, I have had one instance of memory corruption, forcing a restart of my laptop.  I can’t pin that on any one piece of software in particular, but a solid rule would be that while development and evaluation is absolutely fine on any supported OS, the highest level of stability and performance for a production instance will be achieved by putting everything on a standard Linux distro such as Red Hat.

Getting started with Postgres then – there are some things that are familiar to a developer such as myself, used to working day-to-day with MS SQL Server.  The SQL syntax is pretty close to the common implementation of the ANSI standard, although there are a couple of nice features not available in T-SQL such as “CREATE OR REPLACE …” and ” DROP IF EXISTS …” statements, which cut out some of the ugly boilerplate required in deployment scripts.  Tools like query plan analysis also seem to exist in a broadly similar form, although no doubt the actual art of performance tuning will not be identical.

Taking vanilla SQL statements as a given though, the differences are more important if one is going to be working with Postgres in any depth – and they are fairly significant.

The most major one in my view is that Postgres by default uses Multi-Version Concurrency Control (“MVCC”), which has a fair bit in common with the non-default “Snapshot” isolation level in SQL Server.  Basically, multiple versions of rows are maintained as changes are made, meaning reads are never blocked by writes (although clearly if multiple concurrent writes occur, these will still need resolving consistently).  The downside is an increased storage and maintenance overhead which must be handled by a process called “vaccuuming” (removing expired row versions) – it seems that anyone running a Postgres instance at scale would definitely need to understand the options to configure and monitor this process.

Secondly, clustered indexes (which are absolutely fundamental to MS SQL database design) simply don’t exist in the same way in Postgres.  All indexes are essentially non-clustered.  You can achieve physical ordering of the table using the CLUSTER statement, but this ordering is not automatically maintained and any data added or updated will not necessarily be in order until CLUSTER is again run.  I’m sure there are some good reasons why this is the way it is (and it may even offer more flexibility in some cases) – but it’s absolutely something to be aware of.

Another difference in physical implementation is that Common Table Expressions (“CTEs”) are always materialised in a query.  You can read some more detail on that here, but basically it means that certain important optimisations may not be available to the database engine in Postgres, when they would be applied in MS SQL.  If like me you are a big fan or CTEs, that might mean a shift in behaviour from using CTEs in some cases to using “normal” subqueries (which are not materialised).  Not a massive incovenience, but definitely a slight decrease in expressive power and something you need to be aware of to avoid unexpected performance bottlenecks.

My final very minor gripe with Postgres – before we get on to the good/ great/ awesome stuff – is that they haven’t implemented the ANSI SQL “MERGE” statement (see Microsoft’s version here), or any alternative form of “Upsert” syntax.  Not the end of the world, but it is very useful and would be a welcome addition.

So what is really good about Postgres?  Well, it’s very early days, but I’m excited about the programming possibilities it offers.  The standard procedural language, PLPGSQL, seems pretty powerful – for example, functions that modify data automatically scope themselves as a transaction and roll back on failure.  Beyond that, functions can be alternatively written in C as standard, and in Python (or Perl or TCL) with what sounds like minimal extra effort to install a widely-used plugin/extension.

Really though, the thing I noticed this time round was the greater power of the type system, when compared to SQL Server.  On a strictly ANSI level, things like having distinct INTERVAL types for time calculations, and enforcing them more strongly than in MS SQL, hint at a more reliable type-based style of programming available in PLPGSQL, overloading standard arithmetic operators rather than relying on proprietary hacks like Microsoft’s DATEADD, etc.  I’ll need to explore this further but it seems both interesting and useful.

Widening our scope a bit, array types (e.g. int[]) are supported, including as column types, which could be argued to be slightly unorthodox in terms of First Normal Form (definitely two sides to that argument – I won’t expand further here!) but certainly gives more options.  JSON is also directly supported, and Postgres seems to be positioned now as being a convenient and tunable mid-point between NoSQL and pure relational storage (e.g. – see here).

If I had to take a stab at drawing a common theme from all the above differences between the Microsoft and Postgres ways of evolving a relational database, it does seem to me that they are optimised towards two different markets.

MS SQL Server is unsurprisingly well-suited to a pure Windows environment, but (perhaps biased by what I typically work on professionally), it does strike me that a lot of the things SQL Server is good at lean towards traditional data-warehousing and ETL, integrating tightly with SSIS, SSRS and SSAS to form a coherent BI stack.  Features like MERGE and an optimisable implementation of CTEs are extremely useful when loading large volumes of data in a batched manner, and will have made sense for Microsoft to implement.

Postgres, on the other hand, seems very well suited to a niche taking data from a modern web application architecture. Programmers seeking to push this kind of system, bouncing back and forth between the two layers in an open-source environment and bending the data storage layer to meet the needs of the front-end, will probably find Postgres gives them more (and more appropriate) options to do so.

Of course – I’m hoping to learn a lot more about Postgres over the next few months, and may well change my views on some or all of the above.  Nonetheless, the last couple of weeks feel like a good start – and I’ll try and write more as I go.

Hello World

So yes, it’s been about six months since I last blogged.  I didn’t intend it that way but I’ve been pretty busy.

You probably don’t want to hear me bragging on about career stuff (that’s what LinkedIn is for, right?) but the short version is – promotion, leading on a major project using Hadoop, Linux and Python, blah blah blah🙂

Things have eased up just a little now so I’m looking forward to putting some more technical stuff up on here shortly, possibly even this week.

Watch this space… (and also this one).

Performance in Neo4j (“Obligatory Neo4j” Part 4)

This is my fourth post on the popular open-source graph database Neo4j, and I’ll be taking a look at some ideas about how performance can best be managed.  To see the full series, you can check out the Neo4j tag on this blog –

It’s been about a month since my last post.  In the meantime I’ve been busy, playing around with C/C++, Hadoop & Linux – but it’s definitely time to get back to looking at Neo4j.   This is the final part of my “Obligatory Neo4j” series – although almost certainly not my last word on the topic of Neo4j – and we’re now at the point where we should be able to picture some of the functional aspects of a system built on graph technology.  The question is, can we make it run quickly and efficiently?

As ever, there are already some good resources available from Neo Technology.  This includes not only the Graph Databases book, but also this video on Cypher query optimisation from Wes Freeman and Neo’s Mark Needham:

One thing that comes out of that video very clearly for me is that the query optimiser for Neo4j – while already useful – is going through a process of evolution that the equivalent functionality in relational databases such as SQL Server and Oracle more-or-less completed years ago.  When it comes to things like the ordering of WHERE clauses and the placement of filters in the MATCH or WHERE parts of the query, developers using Neo4j need to handle tuning in quite a manual way.

This differs from, say, SQL Server, where the cost-based optimiser will take a declarative query, evaluate various logically equivalent ways of running it and typically reach a “good enough” execution plan without detailed input from the user.  The statement in the video that “soon Cypher will do a lot of this for you” suggests that Neo4j will be heading in the same direction in future versions.

I’d tend to look on the bright side regarding this stage in Neo4j’s development – it’s an interesting time to start looking at the product, most performance tuning cases should not be too difficult even now and as the optimiser inevitably becomes more functional, it will only get easier, while also leaving anyone who has manually optimised queries with a deeper insight into how the database engine works.

At this point, I’ll note a few principles that are very familiar from working with SQL Server:

  • Queries can typically be made to perform better by reusing execution plans (including parameterisation of variables in submitted Cypher calls, which also protects against query injection).
  • More RAM is better – while it’s not a silver bullet, adding more memory to a server will almost always help performance, allowing better caching of plans and more head-room for processing and storing results.
  • Queries that perform scans of entire data structures (relational tables, graphs, labels etc.) are generally to be avoided if the results you want don’t include the entire data set.  Some form of selective seek or lookup is likely to be quicker and less resource-intensive.
  • These selective actions can be made easier by following a smart indexing strategy.

Diving deeper though, the differing nature of graph storage starts to come into play.  There are specific characteristics to the work that Neo4j has to do to return certain queries, which can be advantageous in some circumstances.

The key differences come from a divergence in the way data is structured on disk and in memory.  A relational database is built around tables (“relations”) as fundamental structures and, while indexes may help optimise particular paths to access the data, they are still typically “global” across an entire table.  Relationships between tables are navigated via join columns, which are typically included in one or more of these global indexes.  As the size of the tables grows, the time and cost to navigate them is likely to increase significantly, even with a good indexing strategy in place.

Neo4j takes a different approach, structuring data in such a way that relationships between nodes are completely first-class entities and aiming to achieve “index-free adjacency”.   The implementation basically takes the form of pointer arithmetic over contiguous arrays of fixed-size primary records, allowing constant-time random access to nodes and relationships and then linking to other resources such as properties as required.  A lot more detail on this model and the advantages it can have is available in Chapter 6 of the Graph Databases book.

Setting aside the fine detail, it all leads towards one thing.  That is to say, once one has found an entry point to the graph, the cost of accessing related nodes is proportional not to the total size of the graph database (which can be very large) but to the size of the area of the graph containing the data you’re actually interested in.  The potential advantages of this approach should be obvious.

Nothing comes for free though.  To reliably get this kind of high, highly-scalable performance, it is necessary to design and query your Neo4j database in a way which works well with the underlying graph model and the associated data structures.

The resources provided by Neo Technology give some good hints on how to achieve this, but the following are definitely worth considering:

  • Make sure you understand the limits you’re able to place on the parts of the graph that will be used and searched.
  • Make sure you have explicitly created the index or indexes you need to support this restriction.
  • Check that executed Cypher queries actually behave in the way you expect; it sounds like the “profile” command could be helpful with this.
  • Understand how different parts of your query work in isolation and consider piping only the required results between them using the WITH keyword.
  • Minimise the use of variable-length paths in pattern matches – e.g. “(node1)-[:RELATIONSHIP*1..5]-(node2)”.  While these make queries a lot more flexible and robust in the face of change, the database engine is forced to check more possible paths and hence do more work in executing the query.
  • If you want to have a variant of a relationship – e.g. home vs business address – there are two ways to do it:
    • adding a property to each relationship – “-[:ADDRESS {type:”Home”}]-“
    • adding two different relationship types – “-[:HOME_ADDRESS]-” and “-[:WORK_ADDRESS]-“

Because the primary relationship type is stored directly in the structure representing the graph, whereas properties are stored at one remove, the second option will typically perform better.

  • Add additional relationships.  Complex and multiple interactions between nodes are much easier to add in Neo4j than in a relational database, so in the above example we could theoretically have both sets of relationships, as long as we’re sure they’re useful and we don’t mind the extra overhead of coding and maintenance.

As a specific performance optimisation, Neo Technology actually recommend adding direct relationships between nodes which duplicate the information implicit in less direct relationships.  For example, if two people both worked at the same company, rather than having to go via the company node every time we can add a direct relationship “-[:WORKED_WITH]-“.  These secondary sets of  relationships can be maintained asynchronously (as a frequently scheduled update job) in order to minimise the impact on initial write times when a node is created.

Beyond this list for working at a query level, there are other, lower-level options for performance improvement.  As mentioned in my last post, using the Traversal, Core or Kernel API rather than Cypher should allow progressively more focussed tuning of specific processes.

There are also some architectural and hardware-based options.  The commercially licensed version of Neo4j offers the possibility of master-slave replication for horizontal read-scaling, and a technique called cache-sharding can be used which increases the chance of required data being queued up in main memory.  There are more details on that here.

What has to be remembered when scaling out or considering performance more generally is that Neo4j is – in terms of NoSQL and the CAP Theorem – a well-performing but ultimately consistency-driven (roughly speaking, “CP”) system.  Read-scaling is one thing; high-end write scalability is an inherently hard problem which may better suit an availability-driven (“AP”) system using a key-value, document or column-oriented model.  These kind of systems explicitly accept a looser (“eventual”) definition of consistency in return for the ability to run across a very large number of distributed servers, and make choices about how to deal with the complexity and uncertainty that that creates.  Where this kind of extreme distribution of data storage (combined with high availability) is not required, Neo4j offers many other benefits in terms of greater expressivity, reliability and relative ease of use – and in most cases it seems like it should be able to perform very well if appropriately managed and tuned.

Talking To The Graph (“Obligatory Neo4j” Part 3)

This is my third post on the popular open-source graph database Neo4j, and I’ll be taking a look at APIs.  To see the full series, you can check out the Neo4j tag on this blog –

First of all, a quick reminder of our “set text”, the Graph Databases e-book from O’Reilly and Neo Technology.  Free, concise, full of useful info – what’s not to like?  Download it here.

To recap then… we’ve had a look at Neo4j and the Cypher query language.  We like what we see.  Expressive language, great possibilities for data modelling and a browser interface that draws us pretty pictures on demand:

Neo4j-Screenshot-3       Neo4j shopping data model 2

Awesome!  So all we need to do now is install Neo4j on a server somewhere and give our business users a big text file full of common Cypher queries.  That can’t possibly go wrong!

Sorry, what’s that?  We need to integrate the database with existing in-house systems built in some kind of statically typed, object-oriented language like Java or C#?  And the web team are saying they’ll need to connect from their back-end servers using something lightweight and dynamic like Ruby or Python.  Suddenly things are looking slightly more complicated…

Thankfully, this is not a problem as Neo4j offers a variety of ways to connect from other languages and systems.  The main route in is via the REST API, which (as you would expect) can be called from any language that can piece together the requisite lump of JSON and fire it across an HTTP connection.

There’s actually at least a couple of different ways to structure REST calls, but the best way is typically going to be to send actual Cypher queries as part of the JSON payload.  This is because Cypher and a declarative approach to data interaction are powerful, well-supported and seen as a big part of the roadmap for future Neo4j development.  The queries sent can be parameterised for better performance and can be submitted either via the standard Cypher endpoint or via the Transactional endpoint, which allows an RDBMS-like atomic transaction to be scoped across multiple consecutive HTTP requests.

Naturally, users of various languages have decided that they want to consolidate their use of the REST API into libraries, and many of these are freely available, with links collected on the Neo4j website.  A lot of common languages and frameworks – Ruby/Rails, Python/Django, .Net, PHP – get at least a couple of options each, and there’s even one for Haskell if you’re feeling particularly masochistic.



I was actually planning to look quickly at options for a couple of different languages that I’m somewhat familiar with – a .NET one using C#, and one of the two available for Python.  The idea was that I could do different graph models which would have some relevance to the way the languages are perceived – something “Enterprisey” for .NET, and for Python probably some terrible joke about Spam.

It turned out though that once I got started coding in C#, I spent longer than I expected messing around with associating nodes with classes and trying to work out how I might use it in a real-world situation.  Always the way!  So the Python will have to wait, and in the meantime, courtesy of my new GitHub account, here is some code for an ** extremely ** basic implementation of an “Enterprise Widget Manager”, persisted in Neo4j.

So first of all, the library itself.  Simply because it was nearer the top of the list, I chose Neo4jClient over CypherNet for my first attempt.  Neo4jClient is the product of Tatham Oddie and Romiko Derbynew of .Net consultancy Readify.  It installs straightforwardly via Nuget (always a bonus) and should work with any other language that can run on the .NET CLR – it has certainly been tried out with F#.

Overall, it seemed to work out pretty well for me.  I managed to come out with some code which – while not especially elegant or sophisticated in terms of my own contribution – did run the Cypher queries I wanted to against my install of Neo4j on localhost:7474.  The basic idea was to create a simple parts diagram (components used in multiple assemblies) – the sort of thing which, when scaled up, probably does lend itself better to graph modelling than to a relational database.

Neo4j WidgetManager Graph

The program runs as a simple console app, creating the graph pictured above (obviously I had to connect through the browser to get the visual version), then querying for and displaying a list of components used to make both “Widgets”:

Neo4j Widget CMD

I found the documentation was pretty helpful for getting up and running fast – it makes it very clear that you should read everything through properly before getting into any serious coding, but also gave me enough examples and hints to make a quick start on trying it out.  I’m sure a lot of the other third-party connection options will also be really good and may have a look at them in future.

Going Deeper

Some of you who are really observant may have noticed that, apart from one throwaway reference, I haven’t mentioned the possibility of connecting from Java.  This isn’t because it isn’t possible – quite the opposite.  Neo4j is implemented in Java, and in fact was originally designed as an embedded database for applications running on the JVM.  So if you want to get closer to the underlying database engine – whether for performance reasons or to tweak some functional aspect of the system that Cypher and the default REST API don’t give you access to – you’ll need to be comfortable working with the Java language, or at least something like Clojure which compiles to JVM Bytecode and can communicate directly with Neo4j in embedded mode.

There are three additional Neo4j APIs that Java provides access to:

  • The Traversal API
  • The Core API
  • The Kernel API

Moving down that list, each API gets further from the expressive, declarative modelling approach exemplified by Cypher, but in return allows you to work closer to the metal and permits a greater degree of fine-tuning and performance tweaking.

You can use these APIs when running the database in embedded mode, or there’s also the option to write custom “server extensions” to the REST API, using the Java APIs to redefine behaviour in response to specific REST calls.

The final thing you can do with Java is hack the code base of Neo4j itself.  It’s open source and Neo’s own Max De Marzi provides a great example of how to take advantage of that here.

Now With GitHub + Pointers

I’ve just set up an account on GitHub – not much to see on there other than one very basic C source file, from one of the exercises in Chapter 1 of Kernighan & Ritchie – but watch this space.

Why GitHub?

Pretty normal reasons I guess:

  • The amount of (stricly non-work-related) code snippets I’ve been emailing back and forth between home and work was getting a bit silly.
  • If I ever do put anything decent on there it acts as a portfolio of sorts, and its nice to be able to discuss code by linking to fully versioned examples.
  • I use TFS at work and it seems like a good idea to see what else is out there.  I’m by no means a Microsoft hater but it’s not hard to suppose they don’t get everything right.

Why C/C++?

That’s a slightly more difficult one, or at least more a matter of personal preference.  I of course have a ton of cool languages and technologies I want to get to grips with, so with the modern trend being towards higher-level, more expressive idioms, why am I starting to learn a language where you don’t even get a boolean data type out of the box?

Well, to keep it short and sweet, I really like to understand how things work in detail.  I have to admit to being somewhat influenced by Joel Spolsky’s vintage posts on leaky abstractions and the perils of JavaSchools – macho nerd-elitism aside, a vast amount of the software we use today is in some way built on top of C/C++.  Relational databases, operating systems, frameworks and VMs, IDEs – all the things that are really fundamental to what we do as programmers tend to need at least some access to the low-level power and control that direct allocation of bytes provides.

Just using these tools day-to-day, I’m sure I can benefit from understanding their implementation a little better, and even in today’s declarative, cheap-hardware world, I suspect there are times when JIT-compiled or interpreted languages just aren’t going to cut it.

And even if there aren’t, it’s a good intellectual challenge, which is kind of what we’re all after anyway, isn’t it?