Boosting IT resilience efforts through application performance monitoring

By Saurabh Aggarwal, Han Gu, Arun Gundurao, and Jorge Machado

This is the second in a series of posts on IT resilience. In our previous post, we looked at the case for IT resilience and shared our seven-point manifesto that can help organizations build it. In this post, we examine application performance monitoring (APM) and how organizations can harness its full potential.

IT resiliency requires organizations to rigorously monitor alerts and act on them effectively. However, a growing digital footprint, multiple customer channels, a higher volume of data, an increasingly complex technology environment, and customer demands for 24/7 availability are leading to operational complexities, such as APM tool fragmentation, disparate logs, and a high volume of alerts. As a result, traditional monitoring techniques and APM solutions are becoming inadequate to detect anomalies and fix them before they become outages.

In conversations on IT resiliency, one usually hears the remark, “If only all our teams used this same best-in-class APM tool, we would be all set.” While sophisticated tools are important, organizations must also establish integrated processes to improve data access and the uptime and availability of applications and to gain critical insights into customer experience in order to set priorities. For example, teams and organizations increasingly rely on monitoring and logging data, but the underlying data sets often remain disconnected, making it very hard to translate data into valuable insights.

Improving APM capabilities

Organizations looking to improve their APM capabilities should consider taking five actions.

1. Strengthen APM capabilities in the context of critical customer journeys

Traditionally, IT resiliency efforts have taken an application-based approach, focusing solely on business-critical applications. However, as components become more interconnected, user journeys often involve several applications that need to be monitored at once. To properly prioritize issues and allocate resources, organizations will have to shift to a journey-centered approach and quantify the business value of each journey for both internal and external customers. They must then map out vulnerabilities and costs associated with system outages for each journey, identifying all critical assets and applications and the single points of failure by journey. These can then further be tied to business metrics. Core measurements for customer-facing products could include the number and types of users affected (by region and device, for example) and the impact on user experience (such as a button not working versus an app failing to open).

2. Build an integrated view of the health of your technology environment

IT monitoring is an $11.8 billion industry with a healthy mix of companies that offer both vertical-specific and industry-agnostic tools. These tools can be classified into four main categories: infrastructure monitoring, classic APM, digital-experience monitoring (DEM), and internal-experience monitoring (Exhibit 1). Most teams usually adopt a tool in one of these categories depending on the characteristics they value most—for example, infrastructure teams may want to track storage, whereas product teams may care about user response times.

The IT monitoring landscape can be divided into four broad categories based on monitoring depth and vantage point.

As a result, organizations face issues with integrating alerts across tools in these different categories adopted by different teams, often without a single shared view of integration across tools. This consolidation of alerts into a “single-pane-of-glass view” contextualized by journeys can speed up response times and decision making, leading to improved resiliency. Organizations should strive toward this integrated view of alerts across the ecosystem while exploring consolidation of tools to make the integration easier.

3. Build integrated site-reliability-engineering (SRE) and DevSecOps teams

Integration across processes and APM tools should be supported by integrated teams that bring together product managers, software engineers, DevSecOps engineers, IT infrastructure professionals, and business stakeholders who can respond to and address alerts for the customer journey. The single-pane-of-glass view requires diverse skill sets across teams and layers of the tech stack to troubleshoot issues. Business stakeholders often need to be involved in response to customer communications, for example. Integrated SRE teams and the corresponding product teams should be fully responsible for meeting service-level-agreement (SLA) requirements and becoming experts in infrastructure and application resiliency for specific journeys. This approach will merge personnel tasked with monitoring and problem management, reduce friction, and improve incident response times. Over time, these teams should be closely aligned with product teams and continually improve products through infrastructure and operations automation, monitoring, and best practices for managing capacity, code changes, and incidents.

4. Adopt AIOps capabilities

Once organizations have established requirements and use, the next step is to develop and adopt a holistic approach to AI operations (AIOps). General monitoring and notification capabilities have become table stakes as industry leaders use more centralized data architectures to analyze data across the enterprise. Organizations that properly adopt AIOps can unlock the full potential of machine learning and alert-correlation technologies to make sense of the increasing amounts of observability data, predict incidents, reduce false positives, identify root causes, and even perform self-healing. Self-healing would include, for example, anomaly detection, workflow automation for ticket creation, self-healing orchestrator triggers, Ansible playbook execution, and ticket closure.

To capture this value, teams must adopt an enterprise-wide approach to AIOps. This includes not only establishing a standard instance and taxonomy but also upskilling the advanced analytics talent, refining their operating model, and determining whether to buy or build a solution (Exhibit 2). Only then will teams attain full-stack, end-to-end observability.

Executives should ask themselves some key questions when adopting AIOps.

5. Implement chaos engineering and war-gaming techniques to pressure-test APM capabilities

Traditional quality analysis covers only the application layer and doesn’t test the unique and constantly changing nature of production environments. Chaos engineering simulates and tests a system’s resiliency across a comprehensive range of scenarios—including infrastructure components, external and internal dependencies, and the people and processes behind them—in an isolated environment. By testing and even breaking the systems with worst-case scenarios, organizations can identify and address weaknesses in their tech stack and be better prepared for actual incidents. Chaos engineering will help an organization to improve its APM capabilities, such as by modifying thresholds proactively and building predictive analytics by linking alerts from different systems or tools. Organizations that practice and perfect how teams identify issues with APM tools can respond more effectively to high-priority incidents.

APM in action

One large technology company was trying to tackle many of these issues. It had already undergone an SRE transformation initiative and successfully created integrated teams of developers, architects, and SREs. However, it was grappling with the growing complexity of its IT architecture and underlying monitoring data.

The organization brought business and IT together to set priorities and select the appropriate APM solution. Simply signing the contract wasn’t enough, however. Product teams had independently implemented their own instances of the solution and used their own tagging systems, which had varying levels of maturity in capabilities and adoption. As a result, many of the monitoring challenges remained, with teams looking at different and often incomplete data. Leaders embarked on a yearlong, organization-wide effort to standardize processes and ensure consistent adoption. A series of 15 to 20 workshops were conducted with SREs and site-reliability owners (SROs) to understand the different tools in use. The solutions team simplified the tool stack and conducted road shows with the product teams to educate on best practices and proper implementation. SROs made sure that the recommendations were implemented. This not only improved the overall resiliency outcomes but the developer experience as well.

The importance, urgency, and benefits of investing in APM tools, prioritizing journeys, redesigning processes, and building out resiliency roles are crystal clear. To capture the expected value from these measures, organizations must also be deliberate in their implementation and ensure they have the proper capabilities and processes in place. Such efforts will translate into greater visibility and enhanced IT resiliency.

Saurabh Aggarwal is a director of engineering in McKinsey’s New York office, where Arun Gundurao is an associate partner and Jorge Machado is a partner; Han Gu is a consultant in the New Jersey office.

The authors wish to thank Ritesh Agarwal, Sven Blumberg, Vish Narayanan, Chandrasekhar Panda, and Adi Pradhan for their contributions to this post.