HAMS is Atlassian’s order processing system; if you’ve ever bought an Atlassian product it’s HAMS that’s been doing the work in the back-end. HAMS has served us well, but is over 3 years old now and starting to show some wear, so we set aside August this year to attack some of the technical debt and upgrade the core engine. In a series of blog posts we’ll be describing some of the technologies and trade-offs in a financial-processing system.

OK, so now we’ve fixed our declarative transactions let’s revisit our integration test market-stall and see if things have improved:

The parable of the Merchant and the Customer…

And it came to pass that a merchant had three amphorae of oil, and he took them to market to sell.

And a customer approached the merchant wishing to buy two amphorae, and offered the merchant two shekels each for them. And the merchant agreed, say “I could haggle, but that is a fair price and I am not a clichéd literary allusion”.

The customer was so happy with the price that he decide to buy the remaining amphora for another two shekels, but when he made to leave he realised that his donkey could not hold them. And he said to the merchant, “I must return one of these amphora, for my donkey is an old beater and can’t hack it.”

So the merchant took one amphora back, but kept all six shekels. And the customer did cry out, say “Verily, you have rippethed me off”

— Transactions 4:16

Someone always gets the short end…

There’s two transactions happening here, the customer is giving the merchant his money, and the merchant is giving the customer his goods. However in reality they should be one transaction, and the reversal of one should result in the reversal of the other.

In HAMS we have the same issue. Although to our customers the purchasing of a product appears as one operation in practice behind the scenes a number of different steps are being performed:

  1. The user’s credit-card is charged
  2. The database is updated with the user’s purchase and license details that they can retrieve via my.atlassian.com
  3. An invoice is emailed
  4. If it’s a hosted product, the service is provisioned
  5. Our accounting system is updated with the purchase details

All of these must occur, or none. However of all the operations above, only the database update follows proper transactional (commit/rollback) semantics. So somehow we need to express our non-transactional operations in terms of a transactional system. One method of doing this is to convert the operations into database commits; basically, for each operation that needs to be performed for the purchase you write a description to a table. Then you have a job that checks for these afterwards and performs the work. In the case of any problem the rollback will prevent these work-notes being written.

However there is a simpler method; what we have just described is the outline of a queueing system, and we already have that in Java in the form of JMS, and JMS does support true transactional semantics. There are a number of Java implementations of JMS, the two main ones being ActiveMQ and HornetQ; we currently use HornetQ for reasons I’ll go into shortly. We use this to implement a simple but effective work queuing/retry system:

  • We define a number of events, with one queue per event type
  • Each queue has a listener, that will receive event and perform the work for that event (e.g. sending mails or provisioning hosted instances)
  • Each queue/listener is configured to retry the work a certain number of times (with a configurable delay) in case of problems
  • After a the number of retries has been exceeded the event is sent to an ‘expired’ queue. This triggers an alert to the internal systems team to investigate the issue.

We can use this queue/retry system to wrap the non-transactional units of work in a transactional event. If the overall transaction is successful the work will be performed later.

So now our order system looks like this:

  1. Charge the user’s credit-card
  2. Start a database transaction
  3. Start a JMS transaction
    1. Update the database
    2. Send JMS event for invoicing
    3. Send JMS event for provisioning hosted account
    4. Send JMS event to update the accounting system
  4. If all is OK commit the JMS and database transactions, otherwise roll them back

This is what we have in HAMS 2.x, and it generally works, but there are a couple of gotchas:

  • As the JMS event listeners often operate on the contents of the database you have to ensure the database commit is complete before the JMS events are sent.
  • In the case of a serious failure during the commit phase you can easily end up with situation where the database has been commited but the JMS events not sent, or vice versa.

To counteract this HAMS 2.x had a number of checks to detect these situations, but fixing them up afterwards requires a lot of investigation an manual clean-up, which doesn’t scale.

What we really need is a method of operating on the database and JMS messages inside a single transaction. However the two systems are orthognal; they share no common resources, and may even exist on different physical systems. So how can we bind the two together?

2-Phase commit and XA

Of course, this is hardly a unique problem as many systems, not least banking systems, need the ability to coordinate the updating of data. There are a number of algorithms designed to coordinate disparate cooperating systems, varying from simple check and commit up to complex distributed algorithms such as Paxos. However the most widely used is the 2-phase commit protocol and its corresponding standard, XA.

2PC/XA is fairly complex, mostly due to the myriad of failure modes that must be taken into account, but at the simplest level it works in the following way:

  1. There is a common transaction manager which all cooperating resources are aware of.
  2. At the start of a transaction the manager informs the resources that a transaction has started
  3. At the end of the transaction the manager informs the resources that the transaction has ended; the resources should then perform any work required to commit but not actually finalise it, then report their state back to the manager
  4. The manager then checks that all the resources are OK to proceed. If so, it issues the final commit message, if not it issues a rollback.
  5. The resources commit, and return an acknowledgement, or a failure message.

At each step along the way the manager writes the status of the transaction to a log-file that it can use in the case of a serious failure. Obviously if the final commit fails for any resource things get a lot more complex, but for a well designed system such failures are rare.

So in practice, what XA gives us is to get the transactions of two separate resources to act as if they are a single resource.

XA in Java

The XA specification actually predates Java by several year and was targeted more at C++ systems, however its semantics have been translated into Java in the form of the Java Transaction API, or JTA. Like JMS, JTA is a pure-API specification with the implementation left up to vendors. There are a number of open-source implementations available and we evaluated the main players to ensure we had covered our options:

  • JOTM: This was the implementation for some time, however it is now moribund as a project.
  • Atomikos: This has an opensource and ‘Extreme’ version, with a focus on web-transactions. Support is available. It works well but has a tendency to output spurious messages that can safely be ignored in practice.
  • Bitronix: A fully-open and simpler implementation; the easiest to setup.
  • JBossTS: Formerly Arjuna, soon to be Narayana, this is part of the JBoss application server. Support is available via RedHat. While it is technically possible to run this without the rest of the JBoss suite there is little documentation on doing so. However the rebranding ala HornetQ suggests they want to promote it as a standalone product.

In the end we went with Atomikos, however swapping it out with another implementation is fairly straight-forward once you know the necessary incantations. I won’t go into the steps to set it up with Spring here as it is well documented elsewhere, however I will point out a few useful guidelines we picked-up along the way:

  • Pay attention to the dependencies of the transaction-manager beans; not all the beans explicitly depend on each-other, and if you have beans which access resources during intialisation you can end up with mulitple transaction-managers being created. Spring’s ‘depends-on’ attribute can help you here.
  • Don’t use ActiveMQ with XA; the previous version of HAMS used ActiveMQ for queueing and worked well. However although they claim to support XA it responds badly to suspending and resuming transactions; specifically when using REQUIRES_NEW propagation ActiveMQ confused about which transaction is being acknowledged and you can end up with messages stuck on the queue.

So now our order system looks like this:

  1. Charge the customer’s credit-card
  2. Start an XA/JTA transaction
  3. Update the database
    1. Send JMS event for invoicing
    2. Send JMS event for provisioning hosted account
    3. Send JMS event to update the accounting system
  4. If all is OK commit the transaction, otherwise roll-back

In the case of roll-back, both the database and JMS operations will be reverted.

But what about the credit-card?

We still have one non-transactional part left, the credit-card payment. Converting into a JMS message isn’t sufficient, as it’s possible the charge may be declined; 12% of credit-card transactions in our systems fail for various reasons.

Ultimately this is a pre-commitment issue; HAMS can’t truly guarantee to deliver product until the database commit is complete, but we can’t guarantee that the customer can pay until the card is charged. One option would be to charge the card, attempt the commit and then refund the charge if there is a problem; however in the event of e.g. a serious server crash during processing the card would remain charged and we may not even have a record of it.

One possibility is a common extension to the 2-phase commit algorithm called the “last-resource gambit”. This allows one (and only one) non-XA resource to participate in an XA transaction. This works by commiting the non-XA resource at the last possible second; after the XA prepare phase but before the final commit. If the non-XA commit fails then the XA transaction is aborted. This requires the transaction-manager to be aware of this special resource, but most have some non-standard extensions to support this; unfortunately these tend to be specific to JDBC.

Luckily there is another method; some credit-card gateways support the concept of pre-authorisation. This effectively checks that the card is valid and has sufficient funds to cover the transaction; this amount is then temporarily reserved with an authorisation code, and this code is returned to the vendor (i.e. us). The vendor can then use this code to complete the charge (commit) or cancel the transaction (rollback). However the authorisation code has an expiry associated with it and will become invalid if not used. This last part is what helps us, as in the case of a catastrophic failure the customer will eventually have the funds freed up again.

So we can now update our order system:

  1. Pre-authorise the credit-card charge with the payment gateway and save the authorisation code
  2. Start an XA/JTA transaction
    1. Update the database
    2. Send JMS event for invoicing
    3. Send JMS event for provisioning hosted account
    4. Send JMS event to update the accounting system
    5. Send JMS event to commit the credit card charge
  3. If all is OK commit the transaction, otherwise roll-back
  4. If we rollback then de-authorise the credit card charge (which also happens automatically if the authorisation expires)

With this mechanism in place we now have an atomic order process; either all the updates take place, or all of them are rolled-back.

Pre-authorisation as an XAResource?

You may notice that the semantics of the credit-card pre-authorisation/commit are very similar to the XA 2-phase commit guarantees. It may be possible to provide an implementation of the credit-card gateway connector that can be controlled by the JTA transaction manager. This would give us a true 2-phase implementation, however I have not tried this yet.

The road to HAMS 3.0 – Transactions, atomici...