Reimagining tech infrastructure for (and with) agentic AI

| Article

IT infrastructure is entering a new phase as AI agents increasingly orchestrate, govern, and scale work across the enterprise. Infrastructure no longer acts solely as a support function or control layer; it becomes the backbone of an orchestrated system that determines how effectively organizations capture value from agentic AI (see sidebar “What are AI agents?”).

With AI workloads expanding, however, IT infrastructure costs are projected to increase two to three times by 2030 while budgets remain flat.1 McKinsey’s latest survey on the state of AI finds that while 62 percent of organizations are experimenting with or piloting AI agents, scaling remains low. In any given business function, no more than 10 percent of respondents say their organizations are scaling AI agents.

For chief technology officers (CTOs), this creates an urgent dual challenge: Upgrade infrastructure so it is fast, scalable, and reliable enough to support agentic AI, while using agentic AI itself to contain the rising cost of doing so (see sidebar “The evolution of IT infrastructure”). Companies must move quickly, given more than one-third of high performers are committing more than 20 percent of their digital budgets to AI.2

Agentic AI can fundamentally reshape how infrastructure is provisioned, managed, and optimized, with intent-driven connectivity, autonomous operations, and minimal human oversight. Our experience indicates that agentic AI can enable automation of 60 to 80 percent of routine infrastructure work over time, translating to a 20 to 40 percent run-rate cost reduction in initial deployments, with further gains as adoption scales. It will, however, require CTOs to deliver on a raft of needs such as automation, environment simplification, operating model redesign, and active cost governance, while simultaneously improving resilience and delivery speed.

To support this shift, leading organizations are beginning to redesign infrastructure applying principles more commonly associated with architecture, such as modularity, composability, and orchestration.

Three pressure points for infrastructure

As companies look to scale their agentic AI programs, infrastructure leaders face three structural pressures:

  • Infrastructure must run materially faster and at scale. Innovation in agentic AI is flourishing but often in silos, creating fragmentation that slows the ability to reuse agents and scale. As a result, less than 10 percent of agentic programs reach meaningful scale. At the same time, demands are increasing as developers work faster and as the need to coordinate agents, tools, and data across environments increases. Environments designed for ticket-based workflows cannot sustain this throughput.
  • Non-labor costs are rising rapidly as AI workloads expand. Rapid growth in compute and storage demand (both on-premises and cloud), amplified by agentic AI, is driving a projected two- to threefold increase in IT infrastructure costs by 2030.3 At the same time, infrastructure budgets are expected to remain relatively flat.
  • Outage risk carries greater financial consequence than ever before. Resilience has become inseparable from brand reputation, security, and enterprise risk. Systems are growing more complex as they adapt to the needs of agentic AI, creating more points of failure and making observability and control significantly harder to maintain.

What agent-ready infrastructure looks like

Infrastructure has historically been built primarily for human-led operations. In the agentic era, that is no longer sufficient. At scale, this requires more than incremental upgrades. Infrastructure must evolve toward a more modular, “mesh-like” design, where agents, tools, and enterprise systems are connected through a shared orchestration layer. This enables coordination across domains while maintaining control and reuse.

To achieve this vision, four foundational capabilities are critical:

  • Repeatable and executable actions through secure APIs. Repeatable actions must be accessible as code with embedded policy checks.
  • Reliable operational data. Clear sources of truth for assets, dependencies, ownership, logs, and metrics reduce ambiguity and enable safe automation. Imperfect data should not prevent progress. Many high-value use cases can be piloted even in environments with inconsistent database fidelity or fragmented repositories.
  • Embedded controls and agent governance. Permission models must define what agents are allowed to do and under what conditions, with clear digital identity, ownership, and accountability for every agent. All actions must be logged, traceable, and auditable, with enforcement of policies across environments. High-impact actions require human approval, supported by supervisory mechanisms to pause or override automated behavior.
  • Agent life cycle management, interoperability, and context. Organizations need a clear inventory of deployed agents, defined scope for each one, performance tracking, and life cycle management. As agents scale, teams must also manage cost and resource consumption explicitly, including monitoring inference usage and execution patterns to avoid unexpected cost spikes. Agents increasingly operate across systems and platforms, requiring interoperable control planes and integration patterns. They also depend on a structured understanding of the IT estate, including dependencies, ownership, and known failure modes, to operate safely and make informed decisions.

Building on this foundation, leading organizations are converging on a more mesh-like approach to infrastructure, where agents, platforms, and systems are interconnected through a shared orchestration layer. In practice, this introduces a set of design principles that shape how infrastructure behaves at scale:

  • Composability: Infrastructure components, agents, and tools can be reused across workflows without rework.
  • Decoupling: Execution, orchestration, and data layers are separated to improve scalability and flexibility.
  • Vendor flexibility: Components can evolve independently, reducing lock-in and preserving optionality.
  • Governed autonomy: Agents operate within defined policies, with clear accountability and escalation paths.

Most organizations already run platforms such as ServiceNow, cloud management tools, network controllers, and observability stacks, many of which are adding AI capabilities. The strategic decision for enterprises is not whether to replace these systems, but how to integrate them into a coherent backbone that enables cross-domain coordination and reuse.

Architecturally, enterprises should retain flexibility in how agents are built and deployed, whether through hyperscaler-native services, leading model providers, or enterprise-hosted models optimized for cost and data sensitivity.

Agentic AI can create the greatest value in five areas

Organizations that successfully adapt their infrastructure for agentic AI focus on a set of high-value domains where automation, simplification, and operating model redesign can create near-term impact. These five areas stand out: service desk, observability and IT service management (ITSM), network operations, hosting operations, and active cost and contract management (exhibit).

Agentic AI can unlock signicant value across infrastructure, with the largest impact concentrated in ve core domains.

Service desk

Service desk is the largest and “quickest-to-value” area, accounting for 20 to 30 percent of total infrastructure labor spend. High ticket volumes, standardized workflows, and predictable resolution paths make this area especially well suited for agentic automation.

AI agents can autonomously resolve routine high-volume requests such as password resets and account unlocks, while guiding structured ticket intake through self-service interactions. More complex issues can be escalated to humans under clearly defined governance. Agents can also fulfill standard service requests, such as access provisioning, license assignment, and group membership changes, without manual intervention. Organizations can experience 25 to 45 percent savings along with improved service-level-agreement adherence, always-on support, and better employee experience.

In one example, the IT service desk of a multinational enterprise embedded agents across its support model, aiming to boost productivity, enhance user experience, and reduce resolution times for approximately 450,000 tickets annually. The organization redesigned customer journeys and workflows to enable agent-led resolution, prioritizing an AI intake bot, an agent using interactive-voice-response technology, and a proactive infrastructure monitoring capability. The transformation resulted in up to 80 percent of requests being automated, 50 percent of service agent capacity redeployed to higher-value activities, and a customer satisfaction score of 4.8 out of 5.

Observability and IT service management (ITSM)

Observability, ITSM, and infrastructure operations (including both network and hosting operations) together account for 45 to 75 percent of total infrastructure labor spend, reflecting both platform engineering and operational-response activities. Engineers spend a disproportionate share of time responding to alerts as well as manually testing and deploying fixes, limiting their capacity to proactively manage risk.

Agentic AI automates both diagnosis and resolution. Agents continuously correlate logs, metrics, configuration data, and change histories to identify weak signals, anomalies, and emerging failure patterns. When incidents occur, agents automatically provide relevant context, identify root causes, and execute predefined remediation activities within guardrails. Common issues can be resolved autonomously, while engineers engage selectively on complex cases. (See sidebar “Agentic AI in action: Responding to an alarm” for an illustrative example of this workflow.)

Network operations

Network operations account for approximately 10 to 20 percent of total infrastructure labor spend. Traditionally, enterprise networks were designed for relatively stable traffic patterns and human-led troubleshooting. Network teams operate reactively, with limited ability to proactively intervene upstream of issues.

Agentic AI instead enables real-time, intent-driven management. Agents can ingest and correlate traffic patterns, configuration states, logs, and change histories to detect congestion, abnormal traffic, and emerging risks, and translate higher-level intent into governed network actions. They can autonomously triage network events and execute repeated, routine changes (such as firewall/VLAN updates) in response to policy triggers. By reducing manual efforts in investigation and execution, savings of 20 to 40 percent can be realized in initial deployments, with significantly higher automation potential over time.

One example of this transition at scale is Deutsche Telekom’s agentic network implementation, the “RAN Guardian agent.”4 Operating in the context of network events and exceptional situations, agents actively monitor mobile-network performance, assist in troubleshooting, and optimize solutions.

Hosting operations

Hosting operations, which include on-premises, DevOps (software development and IT operations), and cloud compute and storage activities, account for approximately 15 to 25 percent of total infrastructure labor spend. Hosting operations remain dominated by repetitive life cycle activities in capacity management, patching, and environment provisioning that are still coordinated through tickets and manual intervention.

Agentic AI shifts hosting operations to closed-loop environment control. Agents can continuously assess system health, configuration drift, and policy compliance across on-premises, DevOps, and cloud environments. By standardizing run-time environments, exposing life cycle actions through APIs, and embedding policy as code, agents can autonomously handle routine activities such as rightsizing and maintaining capacity across environments. Executed well, this unlocks 20 to 40 percent in savings in initial deployments, with significantly higher automation potential over time.

In one example, a leading utilities provider is modernizing its hosting operations through a combination of standard and agentic automation. A bottoms-up assessment revealed that several areas, such as capacity management, were labor intensive and repetitive. The organization scaled infrastructure as code (IaC) for provisioning and is introducing AI agents to analyze multivariable trade-offs and orchestrate cross-functional workflows. As a result, approximately 15 percent of potential run-rate savings were identified through IaC and roughly an additional 20 percent of potential run-rate savings were identified through agentic AI.

Active cost and contract management

A significant share of infrastructure spend, which often accounts for 40 to 60 percent of total technology spend, is tied to external services such as cloud, software, and vendor contracts. This spend is typically managed through periodic reviews. As AI scales, costs become granular, multidimensional, and nonlinear, requiring stricter governance.

Agentic AI shifts cost optimization from periodic review to continuous management, unlocking 5 to 15 percent in savings. Agents can ingest real-time usage, licenses, contracts, and pricing benchmarks to automatically rightsize infrastructure, reclaim unused licenses, enforce budget guardrails, and flag uneconomic configurations. In parallel, procurement agents can monitor vendor performance, benchmark rates from historical data, validate invoices, and surface renegotiation triggers based on cost models with actual demand.

The CTO agenda: The first 90 days

Building an agent-ready infrastructure is not a quick fix, but the first 90 days are critical to setting direction, proving value, and building credibility. The most important actions include the following:

  • Redesign targeted processes. Select one area with high volume, clear performance pain points, and strong potential for repeatable execution, such as service desk operations and incident management. Deconstruct the workflow into its component tasks and redesign the process so that routine tasks are executed automatically within defined boundaries, while engineers intervene when judgment or creativity is required. This redesign often simplifies the process itself.
  • Strengthen operational data. Agents cannot compensate for inconsistent system records or unclear ownership. A practical starting point is to clarify the source of truth for assets, configurations, and dependencies. Standard naming conventions, consistent schemas, and explicit ownership reduce ambiguity. Determine if the underlying data and knowledge about the infrastructure are structured and consistent enough for machines to interpret and reuse.
  • Establish strong operating and governance practices. Before agents are allowed to execute changes in production environments, CTOs need a clear framework that defines permissible actions, escalation thresholds, and accountability. Each agent should have a named owner, with clarity around which decisions can be made autonomously and which require review. Logging and audit capabilities must be comprehensive (see sidebar “The human and operating model shift behind agentic infrastructure”).
  • Put in place explicit agent management practices. A formal registry that documents each agent’s purpose, scope, and performance prevents fragmentation. Life cycle management ensures that outdated or redundant agents are retired. Visibility into performance and cost helps the organization understand where value is being created.

For the first time in decades, CTOs can redefine how infrastructure is built and how work is executed. Those who treat agentic AI as an incremental automation layer will see only localized gains. Those who build agent-ready foundations and reimagine infrastructure and operations will fundamentally change their organization’s speed, resilience, and economics. Over time, infrastructure will shift from a supporting function to the platform that orchestrates and governs how work is executed across the enterprise, coordinating networks of agents at scale.

Explore a career with us