Complementing a set of healthy monitoring practices are advanced tools that align with the DevOps/YBIYRI culture. This requires attention to identifying and implementing monitoring tools, in addition to the well understood developer tools of code repositories, IDEs, debuggers, defect tracking, continuous integration tools and deployment tools.
A single pane of glass provides a comprehensive view of the various applications, services, and infrastructure dependencies, not only in production but also in staging. This gives the ability to provision, ingest, tag, view, and analyze the health of complex distributed environments. For example, Atlassian’s internal PaaS tool Micros includes a tool called microscope that provides all the information about services in a concise, discoverable manner.
Application performance monitoring is essential to ensure that the application-specific performance indicators such as time to load a page, latencies of downstream services, or transitions are monitored in addition to basis system metrics such as CPU and memory utilization. Tools such as SignalFX and NewRelic are great for observing metrics data in real time.
Implement different types of monitors including for errors, transactions, synthetic, heartbeats, alarms, infrastructure, capacity, and security during development. Be sure that every member is trained in these areas. These monitors are often application-specific and need to be implemented based on the requirements of each application. For example, our Opsgenie development team implements synthetic monitors that create an alert or incident and check if the alert flow is executed as expected (i.e if integrations, routing, and policies work correctly). We also implement synthetic monitors for infrastructure dependencies that verify the functionality of various AWS services periodically.
An alert and incident management system that seamlessly integrates with your team’s tools (log management, crash reporting, etc.) so it naturally fits into your team’s development and operational rhythm. The tool should send important alerts delivered to your preferred notification channel(s) with the lowest latencies. It should also include the ability to group alerts to filter numerous alerts, especially when several alerts are generated from a single error or failure. At Atlassian, we not only offer Opsgenie as a product that provides these capabilities to our customers, but also use it internally to ensure that we have a robust, flexible, and reliable alert and incident management system integrated with our development practices.