Mike Tria is responsible for creating, maintaining, and updating a shared set of hardware and software tools and practices (like CI/CD and DevOps) for Atlassian’s product developers. The platform his team develops serves as a foundation (and an indispensable toolkit) for all the products and services Atlassian offers. Meet Mike and find out how he does it all – including what unique approaches, skills, and experiences he brings to the job.
“Head of Cloud Platform for Atlassian” sounds like a pretty important job. Can you describe what responsibilities that entails?
Mike: Sure. I run all of Atlassian’s cloud infrastructure, including the servers for all of our cloud products.
I also oversee our operations functions, like site reliability engineering – basically keeping things running. We embed site reliability engineers into all of our product teams. My team also manages CI/CD and DevOps for R&D. We build the PaaS that the company builds its products and services on, then we safely deploy it to production – also ensuring its performance (speed), security, reliability, and scalability.
How do you ensure that all products – old and new – “play nicely” together?
Mike: There are two things we keep in mind as we integrate newly acquired products. The first is moving the new product onto enough of our common infrastructure. We’ve invested a lot into building a nice reliable core that we can layer almost any product onto, and you’re going to get a lot of benefits – like scale, built-in logging and monitoring, on-call capabilities and paging, and all the operational benefits. So one of the first things we tend to do is move those products onto our stack.
Frankly, we learn a fair bit from acquired businesses and products.
The second thing is that, frankly, we learn a fair bit sometimes from acquired businesses and products. Sometimes they do certain things better, so we don’t always move a new product onto exactly what we do. We learn from them, and frequently adapt our own practices to incorporate the best of what’s out there. Opsgenie is a pretty good recent example. They’re really good at alerting and paging, so we’ve adapted our own platform based on that, to make our platform better.
What’s your vision of the roadmap for the platform over the months and years ahead?
In a nutshell, we’re really working to make our platform and infrastructure world-class.
We like to say that we “build security in” to our products, from the ground up. What does that look like on a day-to-day basis for you and your teams?
Mike: It all starts as soon as a developer begins to write code. Security is programmed in from the get-go. We’ve also implemented things like container checks and source clear – where we do internal source code audits – and station and endpoint detection. So we’re doing all these things, checking for vulnerabilities, before the service even gets to our production environment.
We also have a team of “security champions” from across all of our developer teams, who function as a very powerful force-multiplier to our security team. They attend security meetings, they learn about vulnerabilities, they get the roadmap from security. This program essentially turns hundreds of developers within Atlassian into security engineers. You couldn’t get that with a purely centralized security team. So security is really embedded in our DNA. It’s in the new products we build from the ground up, and all of the platform is essentially wired for security. Wherever possible, we take care of security in the platform, so developers don’t have to do that work.
Another important platform-wide tool we use to secure all our products is Atlassian Access. It’s relatively new, but it’s already proven to be hugely popular.
Can you tell us a bit more about that? How does Atlassian Access integrate security into the platform?
Mike: Sure – it’s a very exciting product for us. There’s a whole bunch of things that IT admins and system admins have been asking for, specifically for cloud, that we’ve been a bit slow to deliver. It took us a while to realize that the way to deliver those things efficiently is not to build things like SAML and stuff in each product. We eventually thought, “What if we could build one product that worked for all our customers?” So, if they wanted to use it just for Jira, they could do that. And if they want to use it across several of our products, they could do that, too.
You assign your users to it, and we give you all the IT and security features you want. Then we support, essentially, an infinite number of products on it, and that became Atlassian Access.
We started to think about how we could build a product that operates purely at user level. You assign your users to it, and we give you all the IT and security features you want. Then we support, essentially, an infinite number of products on it, and that became Atlassian Access.
We deliver a lot of highly requested security functionality through Access, and that’s been great for our customers, because it allows us to build those features faster. And it’s given our customers a lot of control over their specific environments to set things up in a particular way for their unique compliance or security needs.
What are the most significant or unique things your team does to ensure the reliability of our products?
Mike:One, we have a dedicated site reliability engineering team whose sole focus is the reliability of cloud products. And, just like with our “Security Champions” program, there’s both a core team and a group of engineers embedded in all the product teams.
We also have a unique incident management process, which we developed using our own products – Jira to log incidents, Opsgenie to alert the right people, Confluence to keep track of what we’re doing to address the incidents, and Bitbucket to zero in on the changes that may have caused the incidents in the first place. Then we generate a post-incident review to make sure the same incident never happens again. We’ve been able to automate much of that process to ensure we can do it at scale, because we have a lot of products, and a lot of users.
We’ve also baked a lot of things into our platform that make our products more reliable by default. For example, we have a progressive system that deploys things in a progressive manner and stops them if we see faults. So it’s all these processes – many of which are automated – that really set us apart.
Get stories like this delivered to your inbox
Tell us about some of the most important things you learned from your pre-Atlassian experiences, and some of the skills you’ve brought to your work here that help you succeed in your current role.
Mike: First, I had to learn how cloud works. I initially had a completely incorrect idea of what cloud was. I was actually a bit of a cloud naysayer.
What changed things for me was scale. All’s fine when you’re managing just five servers for a few thousand customers. But once you really have to scale, you start having to hire two engineers a quarter to swap out hard disks for RAID failures and stuff. I had to learn that the failures I’d grown accustomed to were not, in fact, laws of nature. They could be eliminated.
Another thing I frequently experienced, frankly, was failure. A LOT of failure. I’ve brought down databases and production systems.
Another thing I frequently experienced, frankly, was failure. A LOT of failure. I’ve brought down databases and production systems. I’ve screwed up releases… I could go on. I’ve basically failed in every way you can imagine, from an operations and reliability perspective. But I’ve grown to appreciate all those experiences, because they teach you where the boundaries are, and how to respect the system and the customer. The people who’ve had to recover a system – I trust them to manage that system better than anybody else, because they’ve felt the sting of failure. And I’ve gone through this so many times that I now have a pretty good sense of what’s safe, what you need to automate, what you need people to do – and I’ve applied all of that to my work here at Atlassian.
At startups, specifically, there aren’t a whole lot of established processes, and that can be both good and bad. It’s extremely liberating when you’re innovating, where it’s just like, “What do you want to build? Go! You don’t need to talk to me. Just go build it and test it in front of customers.” But, within a month, you’d better move those numbers! I’ve tried to bring some of that boldness with me to Atlassian, especially in the areas where we innovate around platform.
I like to say I place an uncomfortable amount of trust in people when it comes to innovation. Meaning, I don’t get to tell them what the solution is.
You’ve talked a bit about not being afraid of innovation or trying new things. How do you strike the right balance between innovation and stability?
Mike: I like to say I place an uncomfortable amount of trust in people when it comes to innovation. Meaning, I don’t get to tell them what the solution is. I can only hint at the problem, and I really hope they come up with something great. But it’s about choosing the right people and giving them that freedom. If people feel like they’re just building my vision or someone else’s vision, it takes away the joy, it takes away their spark. I like to make sure that they have that, and I give them a good amount of trust to do it. Do I apply it to everything in platform? Oh, no – not at all. But in the areas where there’s a lot of room to innovate in platform, I would say yeah – I do.