I’ve had Postgres installed at home for quite a while now but not really explored it in detail. As I’ve recently been digging more into open source software in general, and Python in particular (more on which later, probably), I’ve become more motivated to focus on a relational database which is itself open source and sits well within the rest of that stack.
For those that don’t know much Postgres already, you can see a bit on the general history etc. of the project here. What originally interested me about it is that it has a reputation as being more robust, fully-featured and standards-compliant than MySQL (and/or MariaDB).
Over the last couple of weeks, I’ve started setting up Postgres as a back-end for some planned experiments with Python analytical libraries such as NumPy, SciPy and MatPlotLib. For database management and direct query execution, I’ve been using the standard GUI tool PGAdmin, and have also installed the community edition of Pentaho BI (“Kettle”) to do some simple ETL. Everything to date is here on GitHub and I’ll be trying to keep the repository up-to-date as I learn more.
I could of course spend ages talking about any part of that stack, but am really most keen to focus in on the Postgres database engine itself – and will do so for most of this post.
The only more general note I should make at this point is that I’ve been running all of the tools on Windows and, while they pretty much work, I have had one instance of memory corruption, forcing a restart of my laptop. I can’t pin that on any one piece of software in particular, but a solid rule would be that while development and evaluation is absolutely fine on any supported OS, the highest level of stability and performance for a production instance will be achieved by putting everything on a standard Linux distro such as Red Hat.
Getting started with Postgres then – there are some things that are familiar to a developer such as myself, used to working day-to-day with MS SQL Server. The SQL syntax is pretty close to the common implementation of the ANSI standard, although there are a couple of nice features not available in T-SQL such as “CREATE OR REPLACE …” and ” DROP IF EXISTS …” statements, which cut out some of the ugly boilerplate required in deployment scripts. Tools like query plan analysis also seem to exist in a broadly similar form, although no doubt the actual art of performance tuning will not be identical.
Taking vanilla SQL statements as a given though, the differences are more important if one is going to be working with Postgres in any depth – and they are fairly significant.
The most major one in my view is that Postgres by default uses Multi-Version Concurrency Control (“MVCC”), which has a fair bit in common with the non-default “Snapshot” isolation level in SQL Server. Basically, multiple versions of rows are maintained as changes are made, meaning reads are never blocked by writes (although clearly if multiple concurrent writes occur, these will still need resolving consistently). The downside is an increased storage and maintenance overhead which must be handled by a process called “vaccuuming” (removing expired row versions) – it seems that anyone running a Postgres instance at scale would definitely need to understand the options to configure and monitor this process.
Secondly, clustered indexes (which are absolutely fundamental to MS SQL database design) simply don’t exist in the same way in Postgres. All indexes are essentially non-clustered. You can achieve physical ordering of the table using the CLUSTER statement, but this ordering is not automatically maintained and any data added or updated will not necessarily be in order until CLUSTER is again run. I’m sure there are some good reasons why this is the way it is (and it may even offer more flexibility in some cases) – but it’s absolutely something to be aware of.
Another difference in physical implementation is that Common Table Expressions (“CTEs”) are always materialised in a query. You can read some more detail on that here, but basically it means that certain important optimisations may not be available to the database engine in Postgres, when they would be applied in MS SQL. If like me you are a big fan or CTEs, that might mean a shift in behaviour from using CTEs in some cases to using “normal” subqueries (which are not materialised). Not a massive incovenience, but definitely a slight decrease in expressive power and something you need to be aware of to avoid unexpected performance bottlenecks.
My final very minor gripe with Postgres – before we get on to the good/ great/ awesome stuff – is that they haven’t implemented the ANSI SQL “MERGE” statement (see Microsoft’s version here), or any alternative form of “Upsert” syntax. Not the end of the world, but it is very useful and would be a welcome addition.
So what is really good about Postgres? Well, it’s very early days, but I’m excited about the programming possibilities it offers. The standard procedural language, PLPGSQL, seems pretty powerful – for example, functions that modify data automatically scope themselves as a transaction and roll back on failure. Beyond that, functions can be alternatively written in C as standard, and in Python (or Perl or TCL) with what sounds like minimal extra effort to install a widely-used plugin/extension.
Really though, the thing I noticed this time round was the greater power of the type system, when compared to SQL Server. On a strictly ANSI level, things like having distinct INTERVAL types for time calculations, and enforcing them more strongly than in MS SQL, hint at a more reliable type-based style of programming available in PLPGSQL, overloading standard arithmetic operators rather than relying on proprietary hacks like Microsoft’s DATEADD, etc. I’ll need to explore this further but it seems both interesting and useful.
Widening our scope a bit, array types (e.g. int) are supported, including as column types, which could be argued to be slightly unorthodox in terms of First Normal Form (definitely two sides to that argument – I won’t expand further here!) but certainly gives more options. JSON is also directly supported, and Postgres seems to be positioned now as being a convenient and tunable mid-point between NoSQL and pure relational storage (e.g. – see here).
If I had to take a stab at drawing a common theme from all the above differences between the Microsoft and Postgres ways of evolving a relational database, it does seem to me that they are optimised towards two different markets.
MS SQL Server is unsurprisingly well-suited to a pure Windows environment, but (perhaps biased by what I typically work on professionally), it does strike me that a lot of the things SQL Server is good at lean towards traditional data-warehousing and ETL, integrating tightly with SSIS, SSRS and SSAS to form a coherent BI stack. Features like MERGE and an optimisable implementation of CTEs are extremely useful when loading large volumes of data in a batched manner, and will have made sense for Microsoft to implement.
Postgres, on the other hand, seems very well suited to a niche taking data from a modern web application architecture. Programmers seeking to push this kind of system, bouncing back and forth between the two layers in an open-source environment and bending the data storage layer to meet the needs of the front-end, will probably find Postgres gives them more (and more appropriate) options to do so.
Of course – I’m hoping to learn a lot more about Postgres over the next few months, and may well change my views on some or all of the above. Nonetheless, the last couple of weeks feel like a good start – and I’ll try and write more as I go.