Here at Atlassian, we make good use of Open Source products. Not just because we can re-use code, but often because the Open Source products are just as good, or better, than commercial products.
One good example is our use of PostgreSQL. Prior to working at Atlassian, I typically used Microsoft SQL Server or Oracle as an enterprise database server, and Microsoft Access or even (gasp!) Microsoft Excel for quick & dirty situations. They don’t cut the mustard around Atlassian, where we prefer the power of PostgreSQL, which is an incredible enterprise-capable database. (Our co-CEO, Mike Cannon-Brookes, tells me that many startups use MySQL until they grow large enough to hit its limitations, then move across to PostgreSQL).
Okay, back to the story at hand…
We’re in the process of improving our online ordering system (hooray!) and need to migrate a lot of our business data to a new schema, which basically involves lots of data transformations and some simple business logic. Rather than writing a throw-away Java app to perform the migration, we decided to do it in the PostgreSQL native programming language, known as pl/pgsql (“programming language/PostgreSQL”).
It’s truly amazing what can be done in such a language. Database statements can be intermixed with function variables, FOR loops can iterate across the results of database queries and even table names can be used to define variable types.
Unfortunately, performance appeared to be a bit of a problem.
We needed to migrate 14,000 records and it appeared that they would take over two seconds each. That’s about 9 hours to migrate the data — not too hot.
So, I did the hacker thing… I inserted timing points and had the system write the timing to a database table (real easy, since I can just type an INSERT statement directly into the procedure!).
The results looked like this:
I had written the procedures to select one input record at a time, then transform the data. While this is fine for testing, the 0.33 seconds to select a record was clearly not going to scale. So, I replaced it with a FOR loop iterating over the input dataset, which meant that it only executed once, regardless of the number of records being migrated.
Interestingly, that ending up caching some information so process A and C suddenly worked a lot faster, getting it down to a total of 0.64 seconds — or about 2.5 hours to migrate.
I then studied Process B and found that it was re-selecting data from the input record. While this was clean in terms of writing procedural code, it was not optimal in terms of execution speed. This immediately reminded me of a recent IT Conversations Podcast, where Rasmus Lerdorf (creator of PHP) said that web applications would be rejected at Yahoo! if they required more than a couple of database queries per screen.
So, I re-wrote the procedure calls, passing a database record as a parameter instead of a primary key. This is an example of the power of programming directly within a database language — I simply provided a table name as the data type of an input variable, and the database knew exactly what it meant!.
By avoiding this re-querying, I got down to this:
This gave a total of about 1.4 hours, 15% of the original time required, which is well within the ‘acceptable’ range for our migration run. It also means I’ll get back several hours of personal time on our migration weekend, so I can play more World of Warcraft!
- Database programming languages are quite powerful
- Use timing points to focus optimization effort
- Every database query avoided is time saved
- PostgreSQL rocks!