Why the @#$% is Crowd so slow?

Let’s back up a bit. So you have this awesome application that you just wrote. It does everything it should and does it relatively well. Now you want it to join the party and get some SSO-goodness with centralised user management with other cool apps. Enter Crowd.
Rather than use your own local user database you write connectors to hookup your application to Crowd. Crowd is essentially middle-ware with a UI. It delegates calls to back-end user repositories and provides your application with a unified view of multiple directories. This allows you to delegate off some user data storage and authentication, plus gives you integration with a host of LDAP based directories. Sweet.
After a while you realise that your once slick user-management stack is laggy. You start drawing diagrams in your mind. You may have gone from:
To this:
What do you see? An extra network call. SOAP serialisation and deserialisation. A user repository that behaves slightly different and probably suboptimally for the queries you are used to writing. A single point of failure. A bottle neck.
Great. Crowd Sucks.
Not entirely. We definitely do have short-comings, know about them, also have some idea of addressing many of them, but need to pick our battles with 2.5 developers on board. In Crowd 1.6, one of these tasks involved investigating caching.

Crowd and Caching

We have had caching in the Crowd client libraries for some time now. Although it’s improved over time, the caches are essentially time-out based hashmaps. So if your application calls findPrincipalByName(“bob”), the first call will contact Crowd to get “bob” and subsequent calls will hit the cache. As these caches sit on the client side, they can save the network call, serialisation/deserialisation, and calls to the remote directories.
The obvious problem with this setup is that another application could update the “bob” object and the cached “bob” would represent stale data. For example, Jira finds “bob” and puts him its client cache. Now if Bob updates his email address in Confluence, Jira won’t know until the “bob” it cached times out. The impact isn’t just limited to trivialities like email addresses – think about deleting users or assigning/revoking membership of groups. The only way the client cache is updated to pick up changes is when the element times out.

Some Numbers: Crowd and Jira

Jira is a beast when it comes to exercising Crowd’s performance. I’m sure it sounded like a good idea at the time, but requiring all users at any one time inherently limits the scalability of Jira. That said, it’s not something that will change overnight (or over weeks). In order to integrate with Jira, Crowd needs a way to provide fast access to the collection of all users (and all groups) to Jira. Client-side caching does this pretty well.
Let’s get some actual data so we can get a feel for performance. For this experiment, Jira has been hooked up to Crowd, backing off an Active Directory instance with the following standard configuration:
The AD instance contains the following amount of data:
Now let’s turn on Jira’s profiling and examine the time it takes to load the dashboard of clean instance hooked up to Crowd:
The initial request populates the client caches with all the principals and groups from the Active Directory and takes over 40 seconds. All subsequent requests experience a speedy 400x improvement. Although the speedup seems awesome, and makes our configuration useable, the joy expires when the cache does. So what was our solution? Use big cache timeouts – we’re talking over 2 hours. Some customers have even asked us whether it’s safe to have the cache configured to be eternal … talk about stale data.
We all know that increasing the cache timeout just buries the problem. One request, every so often, is going to be pretty unhappy.

The ideal solution

The ideal solution would be if Crowd could notify clients when mutations occur so that the clients could update their caches. So when Confluence calls Crowd to update Bob’s email address, Crowd pings the other affected application’s to notify them of the mutation. It would require a bit of effort to setup Crowd so that it can support 2-way communication, but could work quite well.
The problem is that not all mutations are executed by Crowd. It is possible for customers to mutate the back-end servers directly without using Crowd. So say if a sys admin were to create “bill” using an LDAP thick-client, Crowd won’t know “bill” was created unless you tell Crowd to execute a search. Crowd doesn’t monitor mutations from remote directories.

Caching in Crowd 1.6

So the first step is to implement such monitoring. If you’ve done any LDAP-related programming, you .. will understand my pain. Each LDAP server implements its own version of the LDAP spec in some obscure, undocumented manner or totally disregard existing specs and invent their own. The end result is: we’ve been able to accurately monitor Microsoft Active Directory and ApacheDS for remote directory mutations. We have also been able to monitor Novell eDirectory and Sun’s OpenDS but as we haven’t done thorough testing on these two directories, 1.6 will not support the monitoring feature for these.
Some directories natively support event notification (eg. ApacheDS) whereas others require polling for changes (eg. Active Directory). In order to consistently identify changes, we have implemented a cache on the Crowd server side to store a representation of the entities in the remote directory. This cache serves two purposes: to help us detect ‘change’ and more importantly reduce the number of calls to backend server so that we can reduce the time taken to execute calls such as “find me all the users and groups and the universe” by determining the answer locally.
The server side cache has been implemented with these two key properties:

  1. Be lazy: never load any data it doesn’t immediately need. This means apps that don’t need the entire userbase (ie. not Jira) can still be zippy.
  2. Be up-to-date: hook the cache up to the remote directory monitors so that cache appears synchronised to the underlying directories (ie. eternal cache).

Although we didn’t aim to achieve the lofty goal of client-side event-driven caching for Crowd 1.6, we’ve established much of the required structure on the server side to allow for event detection and directory caching.

Numbers revisited: Crowd and Jira

So how does server-side caching affect the dashboard loading situation? These are the results we obtain when we repeat the experiment with Crowd server-side directory caching (DC) enabled for Active Directory:
Let’s examine the results one by one:

  • Initial request: requests are made by Jira to find all the principals and groups from Crowd. As Crowd’s directory cache is unpopulated, the request goes through to AD. Surprisingly, using directory caching significantly improves this process as intermediate results such as group members can be retrieved from the cache after all principals are retrieved, resulting in a 30 second speedup.
  • Subsequent request: the next request will result in a cache hit inside Jira’s local cache with or without directory caching. The client libraries don’t even need to call Crowd.
  • Post-Timeout request: once the Jira cache expires, Jira must make a call to Crowd to retrieve the current state of the user repository. Using directory caching allows Crowd to serve cached copy of all the principals and groups back to Jira, resulting in a 40x speedup when a client cache miss occurs.

What do these results say?

  1. Client-side caching cannot be disregarded: saving the network call, serialisation, cache lookups and clones saves an order of magnitude of time (100ms vs 1000ms).
  2. Server-side caching allows us to maintain a less-stale view of the user repository in the client side, time-out based, cache by allowing us to use a smaller cache timeout: saving the network call to the Active Directory server and relevant directory server side processing also saves an order of magnitude of time (1000ms vs 10,000ms).
  3. Server-side caching allows group lookup calls to be faster: this is because the group membership mapper can use the cache where the data is available and load the elements it requires on demand. This improvement is about half an order of magnitude.

Next time this instance of Jira experiences a cache time out, we’ll be waiting for 1 second and not 40.
So we’re heading in the right direction 😉

Future Work

Short-term improvements:

  • Expand directory monitoring support: investigate how to monitor remote mutations the plethora of directories we support.
  • Allow directory caching for directories that don’t support monitoring: if a customer knows that Crowd is the only application modifying the directory, then there is no need to monitor for remote mutations. In this case we can benefit by using a directory cache even if the directory doesn’t support monitoring.
  • Variant cache implementation: use a transactional database or do some fancy lock striping if we experience contention lag.

Long-term improvement: obviously, investigate event-driven caching for client caches 🙂

Crowd Caching in 1.6