Production systems typically consume a lot of engineering time, especially when incidents require teams to navigate fragmented tools, infrastructure, and software environments. However, that same complexity has made it difficult to automate much of the process. Spiros Xanthos, CEO and cofounder of Resolve AI, says that AI can fundamentally change how (and how effectively) teams manage this critical aspect of software operations, improving reliability and allowing engineers to focus more on broader systems work. McKinsey Senior Partner Martin Harrysson and McKinsey Partner Prakhar Dixit spoke with Xanthos about the role of AI in production systems, the operational impact of AI-generated code, and how engineering teams and workflows may evolve.
This interview has been edited for length and clarity.
Prakhar Dixit: You and your Resolve AI cofounder, Mayank Agarwal, have spent a long time building development and observability tools. What is Resolve AI, and how did your respective backgrounds lead you here?
Spiros Xanthos: Mayank and I met more than 20 years ago in grad school at the University of Illinois Urbana-Champaign. We have been working together since 2012, including cocreating OpenTelemetry and building companies that were later acquired by VMware and Splunk. Most recently, we led Splunk Observability—me as general manager and Mayank as chief architect.
Over that period, we saw that most engineering time was spent maintaining and running existing software. Until recent advances in large language models, it was difficult to imagine automating much of that work. About two and a half years ago, Mayank and I began exploring whether AI could help teams operate production systems more effectively. That became Resolve.
Martin Harrysson: You are focusing on a part of the software life cycle that consumes enormous engineering time but has received less attention in the AI discussion so far. What makes production environments so difficult to automate?
Spiros Xanthos: Even with coding, models become much less effective once they move into real enterprise environments. Large organizations typically have multiple generations of software, hybrid infrastructure across clouds and on-premises systems, and many different tools holding fragmented operational data across monitoring, observability, infrastructure, source code, and changes. Since no single system or team has a complete picture, maintaining reliable production systems often requires infrastructure engineers, application developers, IT, and security teams to work together across those environments and tools.
The challenges become even greater during production incidents, especially in customer-facing systems. If you are on call during an incident, when many things may be failing simultaneously, you are debugging under significant time pressure.
That is where we started with Resolve. We believed that if AI could help teams navigate those types of situations more effectively, it could improve reliability while also giving developers more confidence to move faster elsewhere in the software development life cycle knowing that any potential issues would more likely be caught and fixed effectively.
Prakhar Dixit: How does AI change the production challenge?
Spiros Xanthos: Long before AI-generated code, enterprises were already spending a significant amount of engineering time maintaining and running production systems, where on-call responsibilities alone can account for 20 to 30 percent of a developer’s time. In many legacy environments, broader work just to keep the lights on can consume well over 50 percent of engineering capacity.
What changes with AI is the scale and pace of software development. A lot more code is being generated, and developers are often less familiar with the systems they are shipping, likely resulting in lower-quality code. So far, the companies that have adopted AI-led coding most aggressively have typically been smaller and more AI-forward. Larger enterprises have not yet seen the full effects of that tsunami of code, and they will increasingly need AI to help manage the operational complexity that comes with it.
In many legacy environments, broader work just to keep the lights on can consume well over 50 percent of engineering capacity.
Martin Harrysson: That suggests AI could also change how teams respond when something breaks. Are you starting to see that, and what does it look like?
Spiros Xanthos: I think we are already seeing some collapse of roles across engineering teams. Traditionally, if an application developer was on call and something went wrong in infrastructure, they would rely on a platform team to investigate it. Tools like Resolve offer much better self-service capabilities across the entire stack. An application developer can understand whether something is an infrastructure issue and even help debug it directly.
There is also less need for large war rooms with many different specialists coming together to troubleshoot problems. Developers do not have to be experts in every observability tool or infrastructure interface to understand what is happening in production.
More broadly, distributed systems expertise can increasingly be built into AI. That allows engineers who are sophisticated in systems thinking to do much more across the stack.
Larger enterprises have not yet seen the full effects of that tsunami of code, and they will increasingly need AI to help manage the operational complexity that comes with it.
Prakhar Dixit: Large enterprises are trying to adapt while the landscape is still changing very quickly. What do you see from customers as they try to navigate that uncertainty?
Spiros Xanthos: There is a lot of noise right now around what tools to use and where the landscape ultimately settles. Even in Silicon Valley, people are still trying to figure that out. With customers, I see a lot of confusion. Many organizations do not know what the world will look like in just three to six months, which makes it very difficult to make long-term decisions.
At Resolve AI, we have a specific perspective on how AI will reshape production operations and software reliability workflows. But companies are also looking for broader perspectives on how the product development life cycle will evolve, including the tooling, organizational changes, and ways of working needed to adopt these technologies successfully. While the pace of change can feel overwhelming, the worst thing we can do is become paralyzed by it.
Prakhar Dixit: Large enterprises cannot rebuild their entire environments every time technology shifts. How does that shape the way you think about AI in production?
Spiros Xanthos: Earlier in my career, a big part of the observability thesis was that if you centralized all your operational data into one platform, you could significantly improve reliability. I still think that is true, but in large enterprises it is a massive undertaking. You have to move legacy tools and monitoring systems and change instrumentation across environments.
However, AI in production changes that equation because it can meet enterprises where they are. You do not necessarily have to rebuild or standardize everything first. Instead of replacing existing systems, AI can work across fragmented environments and make better use of the tools enterprises already have.
That gives enterprises more flexibility to improve reliability and operational velocity without having to transform their entire environment up front.
While the pace of change can feel overwhelming, the worst thing we can do is become paralyzed by it.
Martin Harrysson: AI in production can create a lot of operational value, but it also introduces new questions around trust and risk. How are enterprises approaching that balance today?
Spiros Xanthos: There has been a dramatic change over the past few years. When we started the company, we did not even know whether enterprises would be open to using AI in production at all. Today, most organizations we talk to believe AI will play a role in running production systems, although there are still concerns around security, data access, and operational risk.
Different organizations are moving at very different speeds, and one of the lower-risk areas where we are already seeing a lot of interest is triaging and investigation rather than automated remediation. A lot of engineering time goes into simply figuring out where a problem originated before anyone acts.
We are also seeing value in customer-reported issues. Such help requests often sit in queues for days before someone picks them up, but Resolve can be automatically alerted and start investigating immediately.
Instead of replacing existing systems, AI can work across fragmented environments and make better use of the tools enterprises already have.
Martin Harrysson: As the AI landscape evolves, what do you think is still underappreciated about applying AI to production systems?
Spiros Xanthos: One thing is how important deep domain understanding will be. The large language model providers are building very powerful general-purpose models, but production systems are extremely specialized environments.
Our view is that you can build domain-specific models and agents that are optimized in different ways for software operations, whether through latency, cost, or quality of outcomes.
In production environments, the challenge is not just generating a single answer. It is understanding systems, investigating problems across many sources of information, and reasoning through operational complexity. That is why we believe the combination of models, agents, and deep domain expertise will matter so much in AI for production.


