An early symptom of software sprawl is multiple Post Incident Reviews (PIRs) indicating upstream changes as the root cause of an incident. A growing number of microservices and an increased volume of change within an environment can put a strain on existing norms around developer collaboration and coordination of change. Even a small increase in change frequency from monthly to weekly for one modernized application can result in a 100 times increase in releases per month. It’s no surprise that developers need to adapt the way they collaborate. Incidents are more likely to occur in production when developer collaboration norms fail to scale in a fast-paced environment.
Creating a non-intrusive way for developers to be aware of upstream and downstream changes is a great way to tame the impact of software sprawl. Within Atlassian, we use Compass – a developer portal that helps teams navigate distributed architectures – to send an in-app notification to development teams about breaking changes to upstream and downstream services. Acknowledging this notification signals to the change initiator that teams responsible for dependent services are aware of the change. This provides an opportunity to collaborate on the change if any issues are expected, reducing the likelihood of unintended impacts in production. Since incidents are bound to happen in a dynamic environment, developer collaboration during an incident is critical to restoring services quickly.
In post-incident reviews where upstream changes are the root cause, it’s common that the time to restore services is impacted by the time taken to identify the problematic upstream change, along with the team or person who owns the service. Logically, reducing the time it takes to identify the offending upstream change reduces the mean time to restore (MTTR) over time. This is made more difficult in a loosely-coupled architecture, where services have a rich dependency hierarchy and the root cause of an incident could be anywhere along the stack. Traditionally, incident responders trawl through logs or change records to identify a change that may have caused an incident. In a dynamic environment, this is like dismantling an ant hill to find the ant that bit you.
Within Atlassian we use the activity feed in Compass to reduce MTTR. It shows all events for upstream systems along with the details of the team who owns it. This significantly reduces triage time by supporting developer collaboration during an incident. Incidents will happen, but identifying an upstream change as the root cause of an incident in a timely manner is critical to ensuring impact is minimized and services are restored quickly.
The activity feed in Compass shows all events for upstream systems, reducing triage time during an incident.
Moving towards a loosely-coupled architecture is one of the key ingredients for team productivity and happiness – the ability to move independently with high levels of autonomy. Left unchecked, software sprawl can reverse some of these benefits, resulting in a busy but unproductive and unhappy team. A common complaint when speaking with development teams is “everything works fine until we need to engage with another team.” This is amplified when software sprawl becomes an issue. A rapidly expanding and changing environment reduces the ability for developers to keep track of who to engage for upstream or downstream dependencies, resulting in an eventual slowdown and buildup of frustration for teams trying to deliver at pace.
Hypothetically speaking, say Alpha squad and Beta squad have an identical number of issues and story points moved to ‘done’ in Jira each week. The Alpha squad spends 90 percent of its effort shipping new features to production, while the Beta squad spends 30 percent on new features and 70 percent working out how to engage with the many upstream services they depend on. Both squads have the same level of output, but only Alpha is likely to be considered productive. Software sprawl magnifies the need for collaboration between teams. Identifying smart ways for autonomous teams to engage on demand is key to unlocking the power of a loosely-coupled environment.
In a rapidly growing and dynamic environment, the ability to self-serve information is important to team productivity and happiness. One way to achieve this is to implement a centralized software component catalog with decentralized management; this is a centralized catalog where each team is responsible for creating and updating the services they own. Traditional environments commonly have a centralized catalog that is managed by a specific team or function. However, this can't keep pace with the rate of change in a distributed environment, resulting in teams creating shadow wikis on who and how to engage. Within Atlassian, we found that a decentralized approach reduces the invisible and wasted effort across teams, improves self-service capabilities, and creates an engagement-on-demand environment. Taming software sprawl by enabling self-serve information on upstream and downstream dependencies is a great way to improve team productivity with complementary effects on team happiness and engagement.
Compass provides a central location for developer-specific information on the software components they own and depend on.