Multi-region KMS encryption, at scale

Atlassian runs at scale.

Our cloud products (and the infrastructure behind them) run across many thousands of compute nodes, across many data centres, all around the world.

Our cloud products handle private and sensitive data, entrusted to us by our customers. It's critical that our customers can trust Atlassian to protect their data.

This blog is a little bit about that story. About how Atlassian uses encryption, at scale, to protect your data, the challenges we faced on the way, and the tech we had to build to get us there.

Background

Our cloud products run on AWS, and our cloud architecture is composed of hundreds of different microservices. Our sensitive data is spread across many AWS data stores – DynamoDB, RDS, S3, SQS, Kinesis, SWF, etc. Data within these stores will generally be encrypted at rest by AWS KMS keys. This means that, on disk (or within whatever at-rest persistence exists), that data will always be written encrypted, and it ensures that exfiltration of the raw data store (eg. gaining access to the physical disks or memory that the data resides on) won't be of any use to an attacker.

This is, of course, industry practice. However, it provides no defence against many types of data exfiltration possibilities, such as:

a failure to restrict access to the at rest data store. For example, leaving an S3 bucket configured as publicly accessible. It seems obvious, but there have been many incidents like this making it into the media in the last few years. Not from us, though! Thankfully, Atlassian has well-established guardrails to defend against this.
an authorised application doing something unsafe with restricted data at runtime. Many runtime operations can unintentionally expose sensitive data. For example, a generic logger could dump all query results. Or application logic could send unnecessary additional data to another internal service. Or just the act of taking a JVM heap dump on a live node. All of these operations risk inadvertently exfiltrating sensitive data, regardless of encryption at rest.
legitimate access to data stores by staff for debugging purposes, or to resolve incidents.

Encryption at rest is a good start, but it does not prevent accidental exposure of data.

Application level encryption

The best way to protect sensitive data is to engage application level encryption (ALE) wherever possible. With ALE, sensitive data is encrypted before storage and only decrypted when required (ie. at the point of use, in the application code). An attacker who gains access to the datastore (or more commonly, who gains access to a historic replica of it, for example, a backup stored in a less secure location) does not automatically gain access to your sensitive data.

Atlassian is committed to applying the highest possible standards for securing our customer data, and naturally, we are looking to enable and encourage the use of ALE, where possible. However, when making a decision to engage ALE and encrypt sensitive data within application code, there are a number of challenges to address. For example, there will be application design changes, such as:

parts of your data will no longer be accessible as plain text in your storage system (eg. RDS), and that data becomes inaccessible via traditional access methods such as SQL queries. This may be exactly what you want (eg. for social security numbers or credentials), but significant application re-design is likely to be required if you intend to apply ALE to data where text-based searchability (including sorting and filtering) has to be maintained.

Engaging ALE also creates significant operational concerns, for example:

that you never lose the ability to decrypt your data.
that you protect the integrity of your encryption keys. When encryption and decryption are being done within your application (instead of within the data store), then your encryption keys are in memory and are at risk of compromise and leakage (eg. inadvertent logging, or heap dumps). Obviously, a compromise of your encryption keys will compromise all data encrypted with those keys.
plus, there will be a performance impact. Performing encryption on individual pieces of data adds significant computational effort to your application. Your application will need to take this into account and adopt caching techniques to manage this.

Envelope encryption

An industry-standard encryption mechanism that provides a good basis to address the operational concerns of ALE is envelope encryption, which is described in detail in the AWS KMS Developer Guide:

“Envelope encryption is the practice of encrypting plaintext data with a data key, and then encrypting the data key under another key”

When you use envelope encryption:

a data key is used to perform the encryption operations on any given piece of plaintext data
a root key is used to encrypt the data key
the encrypted ciphertext and the encrypted data key are bundled together in what is called an “envelope encrypted payload”, and that is what is persisted

Under this system, you can never lose access to your encrypted data as long as your root key remains accessible. Envelope encryption is better than using the root key directly to encrypt the plaintext, because:

each data key is only used for a small subset of your data. If any data key is inadvertently exposed, only a tiny subset of your encrypted data would be at risk of exposure.
the encryption materials (ie. the data key and its encrypted copy) can be re-used across multiple encryption requests, but also rotated based on maxUsage or maxAge policies.
with caching plus regular rotation of your encryption materials, your root keys can be kept out of application memory (where they are at risk of exposure), but you don’t need to incur a network call to perform most of your cryptographic operations – which would not scale to meet even the most basic performance needs.
envelope encryption allows for root key redundancy. You can encrypt the data key against multiple root keys, storing an encrypted copy for each root key. An envelope encrypted payload will remain decryptable as long as any root key remains accessible
it allows for fast symmetric encryption algorithms to be used, but you still retain fine-grained control of encryption and decryption rights by controlling access to the root key. Symmetric algorithms are super fast compared to asymmetric cryptography (and therefore, symmetric algorithms are preferable at scale). However, with symmetric algorithms, once you have access to the encryption key you can perform encryptions and decryptions. You may want tighter control than this, for example, using encryption as a verification of data authenticity. By restricting the identities that can perform encryptions or decryptions against the root key, envelope encryption allows you to retain this control, yet without the performance penalties of engaging asymmetric cryptography on your data.

The AWS Encryption SDK

Envelope encryption is well-supported by the AWS Encryption SDK. This SDK provides implementations to:

create encryption materials (the data keys, plus their encrypted copies)
encrypt plaintext bytes into an envelope encrypted payload
decrypt an envelope encrypted payload back into plaintext bytes
support various types of root keys, including, critically AWS KMS keys
provide caching of cryptographic materials (ie. the data key and its encrypted copies), with the ability to enforce usage limits (eg. age, usage count, or byte count based) before enforcing revalidation/regeneration of materials
enforce an encryption context … a concept that we haven't so far covered …

Encryption Context

An encryption context is a set of additional metadata about the encrypted materials which can be used as additional authenticated data (AAD) during cryptographic operations. When an encryption context is in use, having access to the encryption keys is not enough to perform a decryption. A decryptor must also specify the encryption context that was used at encryption time for the decryption to succeed.

This is used to prevent authorised decryptors from being tricked into performing illegitimate decryptions (eg. an attacker with access to the underlying data store might copy encrypted data associated with one customer to the records associated with a separate customer under their own control, and then use the system as normal to access that data). This is known as a confused deputy attack and is generally mitigated by the use of an encryption context which is built, at decryption time, from the request context in some form.

Encryption context is highly desirable to assist in securing ALE implementations.

So, using the AWS Encryption SDK, implementing ALE should be as simple as:

creating some KMS keys
assigning the appropriate encryption and decryption rights to those KMS keys (via AWS IAM roles, in our case)
using the AWS Encryption SDK to perform envelope encryptions against those KMS keys, using, say, our internal customerIds as our encryption context

… but, (obviously, because we've written this super-long blog post!) for Atlassian, it's not actually that simple. It turns out that we cannot deploy ALE at our kind of scale using the default implementations of Envelope Encryption in the AWS Encryption SDK.

The limitations of the AWS Encryption SDK

Multi-region limitations

The AWS Encryption SDK has been well-designed, and is pretty good overall! However, there are problems with it for systems that run at Atlassian's scale and that have Atlassian's multi-region requirements. The AWS Encryption SDK has not really been designed around use cases that are heavily multi-region, or primarily cross-region.

To achieve regional redundancy for our envelope encrypted payloads, Atlassian uses root keys spread across several AWS regions. We also don’t have root keys in every AWS region we run in (so, in some regions, our applications must go cross-region for all root key operations). However:

in the event of KMS or network outages, encryption must remain possible in unaffected regions. Problems in one (or more) AWS regions cannot be allowed to impact our applications in other AWS regions. The AWS Encryption SDK does not support this – if you attempt to create encryption materials backed by 3 KMS regions, and if any one of those regions is not available, then your encryption will fail!
similarly, the decryption code paths in the AWS Encryption SDK don’t apply any useful precedence rules to the order in which decryption attempts will be made against the available set of root keys. We found the root keys to be simply iterated in encounter order from the envelope encrypted payload. There was not even a simplistic “local region first” policy!

These points are major impediments to using the AWS Encryption SDK in a truly multi-region application. We can’t assume that network connectivity will continually exist between all AWS regions. And nor can we assume that the KMS service will remain 100% operational in all AWS regions at all times.

But … AWS supports multi-region keys (MRKS)? Don't they solve multi-region problems?

No, for ALE and envelope encryption, unfortunately, they don't.

MRKs give you the ability to have the same root key materials used in another KMS key in another AWS region. That is, data encrypted by a MRK in one region can be replicated (raw) into another AWS region, and successfully decrypted by a sibling MRK in that target region. This is very useful for encryption at rest use-cases (where the KMS keys used by the data store must remain region-local), and where data replication is needed to alternate AWS regions.

For example, MRKs allow for bulk S3 data replication of encrypted data within S3 buckets into alternate regions without re-encryption. The contents of the S3 bucket can be replicated (raw) to any alternate region, as long as a sibling copy of the MRK exists in the target region. With MRKs, all data remains decryptable in the target region against a completely different KMS key, because the key materials behind that MRK remain the same.

For envelope encryption, MRKs are not actually of much use. The only real use would be that, if you could infer the local MRK to go to, then you could avoid storing N copies of the encrypted data keys. But this isn't an important optimisation, assuming that the storage space for the encrypted data keys is not large compared to the encrypted ciphertext itself. However, MRKs could potentially be useful to support future region expansion, eg. if you wish to support additional local KMS regions in the future, then these regions could be added at a later date, if the AWS Encryption SDK had a way to infer to use an MRK from the local region rather than the region found in the envelope encrypted payload

Performance limitations

The AWS Encryption SDK is heavily focused (and rightly so!) on the correctness of its cryptographic operations. However, strict correctness and the default cache implementations aren't necessarily suitable for large-scale multi-region use cases, such as Atlassian's. For example:

if you declare that you want encryption materials (ie. data keys) to be used no more than 50,000 times (or for no more than 24 hours), then the caches within the AWS encryption SDK will strictly enforce those limits. Encryption materials will not be used for the 50,001st time, even if replacement encryption materials can’t currently be generated. In this case encryption will fail, even if your application would prefer to be tolerant in this case.
encryption materials generation occurs on-demand, and is synchronous. It operates iteratively through the root keys, which will lead to very poor performance when multiple cross-region requests are required – synchronous encryption latencies of 5-15s (or more) are possible.
the cryptographic materials caches provided by the AWS Encryption SDK will shard the cache entries by encryption context. In systems like ours that need millions of separate encryption contexts (one per customer site), but which must achieve high cache hit ratios for performance reasons, a deeply sharded approach simply doesn't work. High cache hit ratios in our cryptographic materials caches are far more important to us than separating cryptographic materials by encryption context.
the default AWS Encryption SDK behaviour is to pass encryption context through to the requests to AWS KMS root keys (on cache misses), where it can be incorporated into CloudTrail logs. However, this is not strictly necessary, and triggers slow asymmetric trailing-signature algorithms to run at both encryption and decryption time
the decryption logic in the AWS Encryption SDK doesn't apply any intelligence about which root keys to query. When there are multiple root keys to choose from, and most (or all) are going to be cross-region requests, then more sophisticated algorithms are needed. Request latency and failure rates to each of the KMS regions involved should be taken into account to prioritise which root keys to select first.
the cryptographic materials caches are node local, and can’t be distributed in any way. Yet, distributed caches are critical to maintaining high cache hit ratios across our services containing many hundreds of compute nodes

The solution – Atlassian's Cryptor library

Atlassian needed a solution to these problems in order to be able to use ALE at scale. After dabbling with the possibility of implementing our own root key provider, we decided to continue with AWS KMS keys (to keep permissions controlled by AWS IAM roles) and build an encryption library we call "Cryptor". This library would be, essentially, a thin wrapper over the AWS Encryption SDK, but adapted to better suit our needs.

We started with a couple of core design goals:

Key Configuration. Our library would attempt to avoid our engineers having to care about AWS accounts, key ARNs, or IAM roles. Instead they would interact with the library by requesting encryption for key aliases. We would separate all the information behind those key aliases (ie. the relevant key ARNs, etc.), into separate key configurations. We would also provide mechanisms where this key configuration can be loaded (and periodically re-loaded) by the library from external sources. The library will be configured with all authorised encryption key aliases, and a key benefit of this approach is that it allows us to pre-seed encryption materials caches for those aliases. Our goal is to never block encryption requests.
Automated Key Management. At scale, we need to be able to automatically manage all KMS keys (and their encryption/decryption permissions). To achieve this, we model "cryptor key resources" just like any other resource in our systems (eg. a service's DynamoDB or SQS resources). Like other resources we manage, our "cryptor key resources" are declared via service descriptors held in source control. Atlassian's internal PaaS is responsible for translating all resource configurations into Open Service Broker (OSB) API calls to create, update or delete those resources at service deployment time.

So, we built a key management service to receive these OSB API calls for "cryptor key resources" and translate them into the KMS operations required to create/modify/delete KMS keys and their IAM role authorisations. This service is also responsible for generating our Cryptor key configuration documents, which represent the superset of authorised keys and permissions applicable to each authorised IAM role. These key configuration documents are re-generated whenever permission changes occur and are immediately made available to the Cryptor library through a dynamic control plane. With this system:
1. our "Cryptor keys" (and their permissions!) are fully managed from auditable source control systems.
2. the KMS root keys involved have fully managed lifecycles, just like any other resource in our system.
3. changes in key configuration become visible to the services running Cryptor within seconds across our entire fleet of services.
High Availability. We would take some of the smarts we built for our high availability TCS system (see Here's how one of Atlassian’s critical services consistently gets above 99.9999% of availability – Atlassian Engineering) and apply them to the problem of choosing which KMS(s) to query at decryption time. That is, from any compute node, Cryptor will choose to send decryption requests to a "primary" KMS, and sometimes (concurrently) to a random secondary choice. This wastes some KMS requests, but it pre-emptively guards against failures plus allows for continuous monitoring of the observed latency and failure rates to all KMS regions. The Cryptor library uses this data to periodically re-evaluate which KMS region should be considered "primary".
Distributed caching. To achieve high cache hit ratios, we want all running instances of Cryptor with identical permissions (ie. IAM roles) to share distributed caches of cryptographic materials (ie. both encryption and decryption materials). But we must protect the integrity of those caches. We do this, via:
1. periodically importing and exporting local cache contents to an encrypted shared store
2. encrypting the contents of the shared store with dedicated KMS keys. These cache exports are highly sensitive content and are protected by KMS permissions scoped to each individual IAM role, ie. to decrypt an export, you must be using the exact IAM role that created it.
Soft Limits. We would treat configured usage limits on encryption materials as soft limits (up to a point). We would prefer to continue using "sub-optimal" encryption materials (ie. those that have exceeded their maxUsageCount / maxAge, or which are missing a KMS root key due to the temporary unavailability of one AWS region), instead of failing the encryption request. A corollary of this is that we can allow high-performance caches (such as Caffeine) to return "sub-optimal" cryptographic materials while they perform an asynchronous background refresh, and thus never block request threads on encryption. Obviously, however, a policy of soft enforcement still needs some hard limits, and so we will still enforce sane upper bounds after which cached encryption materials will be discarded anyway, plus implement metrics to observe how often "sub-optimal" encryption materials are being used
Soft enforcement of encryption context, instead of using the default AWS Encryption SDK implementation. This avoids an asymmetric signing penalty and allows for cryptographic material re-use across encryption contexts. It also enables our library to allow to support dynamic disablement of encryption context enforcement, to temporarily handle inevitable cases where data is deliberately imported into alternate contexts, and subsequent cross-context decryption attempts are determined to be legitimate. For example, this can happen during Jira site import and/or site sandbox creation, in cases where we miss implementing proper re-encryption steps for ALE-encrypted data. Soft enforcement of encryption context allows us to triage the fault and, if appropriate, temporarily unblock the application while data is re-encrypted with the correct context, after which full enforcement can be re-enabled.
Library or sidecar integration. There's a lot of complex functionality we've built into Cryptor, however not every service within Atlassian uses the same technology stack. So, rather than re-build the exact same functionality across different technologies, we instead wanted to package our library into a sidecar (which is a co-process, run alongside the main application), and expose all library functionality over HTTP and/or gRPC APIs. As long as the sidecar is not exposed to the external network, this is more secure than integrating the library directly into an application (as code and memory remain separate), although it does incur a very slight latency penalty from transiting the local docker network.

So, we went on to build this. And, in the last several years of running it, we are finding that it works pretty darn well…

The results in figures

Currently, across all of Atlassian:

we are running more than 12,500 instances of Cryptor
our automated Cryptor key management systems are managing more than 1,540 KMS keys, with permissions spread across thousands of IAM roles

Across all deployments of Cryptor:

we perform more than 11 billion decryptions per day
- with around 21M cache misses per day
- achieving an overall decryption cache hit ratio of 99.8%
we perform more than 811 million encryptions per day
- with less than 150 cache misses per day
- achieving an overall encryption cache hit ratio of 99.99999%
through our caching, we service this request load by sending around 25 million decryption and 349 thousand encryption requests to KMS per day (this is Cryptor cache misses and background refresh operations)
- across all of Atlassian, the total cost of these KMS operations comes to around $2,500 per month
- without Cryptor caching, the total cost of sending all of our cryptographic operations direct to KMS would be pretty close to $1,000,000 per month.

As you can see, our Cryptor library allows Atlassian to deploy ALE encryption at scale, but without incurring any significant cost, performance, or reliability penalties.

The future of Cryptor

Bring Your Own Key encryption

The principles of ALE, plus automated KMS key management and rapid key configuration distribution are fundamental building blocks to a key part of our enterprise product roadmap – Bring Your Own Key (BYOK) encryption. With BYOK encryption, enterprise customers retain direct control over the KMS root keys that are used to encrypt their data in our cloud.

Our Cryptor library is a key component of BYOK encryption within Atlassian. When a BYOK-enabled customer has their keys onboarded into our automated key management systems, Cryptor's key configuration documents are updated immediately with the additional customer keys. Cryptor is at its core a key selection tool, and through it, requests for ALE encryption will use encryption materials backed by the customers' BYOK keys rather than our own keys.

With BYOK selection information integrated into the Cryptor's key configuration documents, most of our ALE-enabled services remain entirely unaware that certain customers are BYOK enabled. The choice of customer-specific encryption materials is able to be made entirely within Cryptor, through analysis of our existing customer-specific encryption contexts.