The technology shifts reducing AI inference costs

(13 pages)

The race to establish infrastructure for AI is driving one of the most significant capital mobilizations in history. In 2026, the four leading hyperscalers—Amazon, Google, Meta, and Microsoft—are collectively committing over $700 billion in combined capital expenditure, with a substantial majority directed at AI infrastructure. This figure would have been difficult to contemplate just three years ago. The investments span data center construction, accelerator procurement, and networking buildout on a massive scale, underscoring how compute has become a strategic asset. The new infrastructure is creating extraordinary and sustained demand across the entire semiconductor value chain for chips and other components that enable AI processes.

Two of the most important AI processes are training and inference. Training—either one-time or periodic—is the computationally intensive process of building a model by exposing it to large data sets. Inference is the ongoing process of running a trained model to respond to user queries. A single large language model (LLM),¹ once trained, can be queried billions of times per day. The cost of each query is typically small, but it compounds at scale. Historically, training has accounted for most AI compute spending, but the balance is now shifting toward inference.

This dynamic has brought a once-theoretical question to the forefront: How can AI inference be made economically sustainable at the scale demanded by enterprise and consumer applications? Furthermore, how can the energy demand of AI computing be met or reduced? These questions underscore the significant pressures affecting the supply side of the AI economics equation and contributing to spiking AI costs.

Given the magnitude of efficiency gains required, no single breakthrough is likely to deliver the step change needed to achieve positive margins while maintaining frontier-model performance. Instead, meaningful progress will depend on a coordinated wave of innovation across the entire supply chain—from software-level model optimization to advances in silicon architecture, advanced packaging, memory systems, and optical interconnects.

We evaluated 13 of the most promising technology levers related to AI computing and evaluated their potential to reduce inference costs at scale. We also examined two other factors—architecture evolution and chip development timelines—that may affect how quickly and thoroughly companies can reduce inference costs, as well as new compute paradigms that might become an option in 2030 or later.

AI infrastructure is changing fast. In this article, we highlight the technologies most likely to shape AI inference economics and the implications for technologists, investors, and business leaders along the entire supply chain. The discussion that follows is necessarily technical at times because many of the biggest cost-reduction opportunities will come from trade-offs deep within the compute stack.

Also note that the industry faces the dual challenge reducing the cost of intelligence while expanding the range of applications that can be deployed economically at scale. This article examines the first part of the puzzle: the technology innovations reshaping inference economics, from model optimization and custom silicon to advanced packaging and networking. A companion article, “From scale to intelligence: The role of smarter compute in the evolution of AI,” to be published shortly, will explore how the efficiency gains can be translated into broader business impact through smarter allocation of compute resources, data-aware architectures, and new approaches to multimodal and agentic AI.

The inference adoption challenge

Over the years, advances in graphics processing unit (GPU) technology have enabled improvements in deep learning and AI models. While exponential model growth makes traditional GPU clustering more costly and energy intensive, the industry has partially offset these increases by relying on architectural innovations, custom application-specific integrated circuits (ASICs), and software efficiencies. As silicon miniaturization approaches its physical limits, however, the industry will not be able to make advances without new compute methodologies (exhibit). Experts anticipate that these advances may focus on non‒Von Neumann architectures, such as optical or neuromorphic computing. These next-generation methodologies are still years away from broad commercial impact, but they could eventually redefine AI efficiency.

New compute methodologies are required to drive significant improvement in AI performance.

What is a token?

A token is a basic unit of information processed by an AI model, and the standard measure of inference workload and cost. What constitutes a token depends on the modality and model architecture. In text-based LLMs, tokens are typically word fragments, punctuation, or short character sequences. In image models, tokens may represent image patches; in audio models, short sound segments; and in video models, compressed frame representations. AI providers typically bill usage in millions of input and output tokens.

Historically, hardware progress was evaluated primarily through performance metrics such as floating-point operations per second (FLOPS).² Increasingly, however, AI infrastructure competitiveness may depend on how efficiently systems convert power, memory bandwidth, networking, and silicon into inference, making cost per token and energy per token the defining measures of AI economics (see sidebar “What is a token?”).

A parallel dimension of inference economics

While cost per token determines whether inference can be commercially viable, energy per token determines whether it can be physically scaled. Energy intensity has already declined significantly as hardware, inference engines, and model optimization have improved. Even so, it varies materially by model size, workload type, and inference stage. Compute-bound prefill can require much more energy than decode, while larger models add memory-traffic and cache-bandwidth penalties.

In our analysis, most innovations that reduced cost per token also reduced joules per token, because higher throughput and better utilization allow the same infrastructure and power draw to generate more output tokens. The technology levers that most significantly reduce cost per token, such as model optimization, advanced packaging, custom silicon, and CPO, are also critical for easing data center power constraints. As inference volumes grow, reducing energy per token will also become an essential measure of AI infrastructure performance.

While cost will affect adoption rates, it is also important to consider energy per token, which affects scaling (see sidebar “A parallel dimension of inference economics”).

Until novel compute methodologies mature, inference at scale will remain economically challenging for many AI applications. Across most enterprise software use cases, including those for chatbots, voice, and image generation, achieving sustainable margins would require inference costs to decline by several multiples. This unfavorable cost structure reflects the combined impact of expensive hardware, rising energy demands, memory bandwidth constraints, software inefficiencies, and growing operational complexity.

While most AI applications are unprofitable, a few are beginning to approach economic viability. Coding assistants and agentic AI systems, for example, generate higher value per token and measurable productivity gains that compound over time. Still, margins in these categories remain significantly below investors’ and operators’ expectations. Meanwhile, enterprise customers are already concerned about rising token budgets.

Methodology: The cost-per-token framework

We quantified the impact of each innovation by examining two sets of multipliers (exhibit):

cost multipliers, which capture changes to the cost structure, including those related to compute hardware, memory, network and infrastructure, power, and software
throughput multipliers, which capture changes to tokens processed per second, including those related to hardware utilization and the efficiency of compute, memory, and networking

The impact of each innovation is based on the reduction in cost per token.

We applied these multipliers to the baseline cost per token to produce the innovation-adjusted cost.¹ When estimating baseline costs, we considered hardware asset life, GPU utilization, power costs, and innovation timelines. For each variable, we made several assumptions (for instance, an asset life of four to six years for GPU-class accelerators). We estimated the impact of each variable in isolation, although codeployment may compound the benefits.

We analyzed the 13 technology levers listed below (table), covering multiple parts of the AI supply chain, to determine the extent and timing of their impact (see sidebar “Methodology: The cost-per-token framework”). Some innovations in these levers are already available, while others are still in pilots.

Technology levers examined in our analysis
Lever	Impact drivers
*Compute and memory*
Custom silicon (inference ASICs)	Uses hardware and lightly integrated software stacks custom built for specific AI tasks
Advanced nodes (< 2 nm, entering Angstrom level)	Improves transistor efficiency, allowing the same work to be done with less power
Advanced packaging (beyond 2.5 dimension)	Puts more memory closer to compute within the device and widens the on-package connections
Chiplet-based AI accelerators	Scales memory and compute without using one giant, monolithic die; instead, splits them into smaller dies
High bandwidth memory evolution (HBM4 and beyond)	Increases memory bandwidth and capacity so models are less constrained
Memory pooling/disaggregation (Compute Express Link)	Allows memory to scale separately from compute so capacity is properly allocated
Processing in memory	Runs selected operations in/near memory so less data is moved back and forth
*Networking*
Scale-up interconnect (integrated racks, open fabric)	Extends how many accelerators can work together before hitting slower network boundaries
Scale-out fabrics (800 Gb/second or more)	Reduces network overhead of training clusters through faster links and better congestion control
Co-packaged optics	Lowers power and complexity of moving data at very high speeds over longer distances
*Software*
Inference engine (workload orchestrator)	Packs requests efficiently and manages context memory for less capacity waste per request
Model optimization (eg, low precision, pruning)	Reduces the work and data moved per operation through software techniques
Computer stack for AI accelerators	Automates mapping of model code to hardware, reducing manual tuning

Applying the cost-per-token framework to the 13 technology levers reveals four with the highest potential to reduce inference costs: model optimization, advanced packaging, custom silicon, and co-packaged optics (CPO). Together, they represent the highest-priority areas for technology and investment focus, and highlight where value may accrue across the semiconductor supply chain—from electronic design automation (EDA) software and IP providers to foundries, advanced packaging players, memory suppliers, optical component manufacturers, and equipment vendors.

By decreasing the cost of inference, these levers will not only improve enterprise AI value but expand potential AI use cases. Companies may maximize this value if they adjust their operating models to support AI.

Would you like to learn more about our Semiconductors Practice?

The technology levers with the greatest impact

Our analysis reveals that the technology levers with the most impact can each drive a sustained downward cost trajectory, with the top ones approaching an order-of-magnitude reduction—a tenfold decrease—in cost per token. When these solutions are combined over time, inference costs could drop by up to two orders of magnitude.

This trajectory could fundamentally reshape the market. A tenfold reduction expands enterprise AI from high-value, niche use cases to everyday workflows; even greater reductions could enable high-volume applications that are now unfeasible because they have negative unit margins. The breadth of future AI adoption will depend as much on these infrastructure economics as on advances in raw-model capability. For operators and investors, tracking the specific innovations that reduce cost per token will be just as critical as measuring frontier-model intelligence.

Model optimization: The most powerful near-term lever

Model optimization encompasses a wide range of techniques that are applied directly to trained neural networks to reduce computational and memory requirements while preserving output quality. Two common optimization techniques include quantization and pruning. In our analysis, model optimization produced the greatest near-term impact.

Quantization

Quantization reduces the numerical precision of model weights and activations. High-precision formats, such as 32-bit integers (INT32) or 16-bit floating point (FP16), are converted into lower-precision formats, such as 8-bit integers (INT8), 4-bit integers (INT4), or even binary representations. The lower-precision weights occupy less memory, enabling hardware to process larger batch sizes more efficiently and increasing utilization.

The impact of quantization depends on the stage of inference to which it is applied. During the prefill phase, when the model reads the prompt, performance is mainly limited by compute. During the decode phase, when models generate tokens one by one, performance is mostly limited by how fast data can move in and out of memory. In real-world workloads, where decoding typically dominates, quantization can directly address memory bandwidth limitations. For example, a shift from FP16 to INT4 can improve throughput on inference workloads two- to four-fold, while lowering cost per token by enabling the hardware to handle more queries simultaneously.³

Pruning

Pruning reduces model size by eliminating individual weights that contribute marginally to model output. Structured pruning, which removes entire components, such as attention heads or feedforward layers (one-way processing components inside a neural network in a transformer), creates simpler models and minimizes sparsity patterns—zero weights scattered randomly—that are harder to process efficiently. Pruned models require less memory and fewer operations per inference pass, reducing both memory bandwidth and compute requirements.

The efficiency gains from pruning are highly workload-dependent, with disproportional benefits going to tasks that do not require a large model’s full representational capacity to capture complex patterns, nuances, and relationships in data. Combined with quantization, pruning can amplify efficiency gains, enabling significantly higher hardware utilization than baseline deployments.

Combined impact on inference economics

In our analysis of various quantization and pruning configurations, model optimization reduced cost per token by 85 to 95 percent, with results varying based on the precision level achieved and the workload’s tolerance for minor quality trade-offs. At the upper end of this range, quantized and pruned models running on current-generation GPUs can reduce token costs enough to make high-volume enterprise applications commercially viable without requiring hardware upgrades.

These gains are possible because, as research suggests, dense transformers use only a small fraction of their parameters for each token, meaning much of the compute in a typical inference pass is effectively wasted. While algorithmic improvements are the most powerful near-term lever for improving efficiency, additional hardware innovations are required to reduce costs further and address the physical bottlenecks related to memory and data movement along the supply chain.

3D Advanced packaging: Structural efficiency gains

Our analysis projects an 80 to 90 percent reduction in cost per token from 3D advanced packaging, provided that large-scale chip production involves hybrid bonding, also known as direct bond interconnect (DBI), rather than the traditional microbump-and-underfill 3D process.

In hybrid bonding, chips are directly joined through dense, high-speed copper-to-copper and oxide-to-oxide connections. Flat copper pads can be embedded in the insulating dielectric layer with no underfill required. Bond-pad density is far higher with hybrid bonding than with bumps, significantly reducing the energy required to move data between chips. Hybrid bonding also enables thinner stacks, making more layers possible, and denser interconnects.

The DBI competitive landscape is concentrated around the same core technology, developed by Adeia, formerly Xperi. Licensees include Intel, Micron, Samsung, SK hynix, Sony, and TSMC. Consequently, differentiation depends on how each company integrates the technology into its manufacturing processes.

The drivers of 3D advanced packaging have evolved. Initially, most demand stemmed from memory-bandwidth requirements for training models, but inference is now the dominant volume driver. Memory bandwidth continues to constrain inference performance and is a major cost driver. Hybrid bonding helps address both challenges by bringing memory closer to compute. This technique involves vertically stacking dies and connecting them through dense direct interconnects alongside through-silicon vias (TSVs). This hybrid bonding shortens communication paths, increases bandwidth density, and enables more compact, cost-efficient system designs.

Within 3D memory-logic integration, two architectures are possible: memory-on-logic and logic-on-memory. Memory-on-logic, which places memory on top of the logic die, is currently the preferred approach and is already used in commercial products such as AMD’s V-Cache on TSMC’s SoIC platform. Positioning the larger logic die at the bottom simplifies manufacturing and improves yields. Logic-on-memory offers superior thermal performance because the heat-generating logic sits closer to the heat sink, but it requires advanced backside power delivery technologies such as Intel’s PowerVia. As a result, adoption is expected later and likely limited initially to the most demanding AI and high-performance computing applications.

Hybrid bonding may also be introduced within high-bandwidth memory stacks, although the timing remains uncertain. Some industry observers expect adoption as early as HBM4 or HBM4E, particularly in 16-high configurations, which stack 16 DRAM dies vertically in a single memory package. At those stack heights, scaling conventional microbump interconnects becomes increasingly difficult, potentially making hybrid bonding a more attractive solution.

The broader industry consensus, however, is that large-scale adoption of hybrid bonding is more likely in HBM5 or later. The transition will depend on whether hybrid bonding can deliver superior yield and cost economics relative to current approaches based on thermo-compression bonding (TCB) and non-conductive film (NCF), an insulating material used to prevent electrical shorting.

Even with the current momentum, hybrid-bonding adoption remains constrained by manufacturing realities. Yields vary significantly by product: Simpler two-layer stacks such as 3D V-Cache already have solid production yields when this technology is applied; more complex 12- and 16-high HBM stacks remain challenging and have not yet been produced at scale with high yields. Capacity for hybrid-bonding equipment is also limited. Consequently, scaling hybrid bonding will require not only continued technical progress but also significant expansion of advanced packaging capacity across the semiconductor ecosystem.

Custom silicon: Matching hardware to AI workloads

As model architectures grow more diverse and inference workload characteristics become more defined, hardware is being codesigned with the models it will run. Rather than general-purpose accelerators optimized for broad matrix multiplication operations, codesigned hardware is optimized for specific memory access patterns and sparsity profiles (how much of a model is run during computations and what parts are used). Codesigned hardware also considers parallelism structures, or how the layers of a model are divided among different devices.

In practice, the shift toward codesign reflects a deeper economic imperative. As AI inference becomes more limited by how fast data can be moved (memory bandwidth), sparsity, and the specific way each workload runs, general-purpose hardware becomes less cost efficient. Aligning silicon design to model architecture enables step-change improvements in utilization, throughput, and cost per token, making codesign a primary lever for closing the inference economics gap.

ASICs designed for AI inference can achieve higher efficiency than general-purpose GPU architectures on targeted workloads. These custom ASICs—including AWS’s Tranium and Inferentia, as well as Google’s Tensor Processing Unit (TPU) and a growing number of offerings from start-up accelerators—are optimized for the core AI workloads that dominate inference, including matrix multiplications and attention operations. Our analysis indicates that chips customized for inference can potentially reduce cost per token by 70 to 80 percent for the workloads they are designed to handle.

In many cases, ASICs are developed in conjunction with AI models. For instance, when creating a new generation of TPUs, Google tailors the memory hierarchy, systolic array architecture, on-chip interconnect bandwidth, and other features to suit the training and inference requirements of its Gemini model. This tight hardware–software codesign enables higher inference efficiency than running Gemini on a general-purpose GPU.

As hardware becomes specialized for different phases of the workload, the codesign dynamic is increasingly visible within inference itself. Some vendors have already optimized their hardware for different stages of inference. For example, NVIDIA GPUs are well suited to the compute-intensive prefill phase, while Groq’s LPU is designed for the memory-bandwidth-intensive decode phase. A more advanced version of this approach was announced by SambaNova and Intel in April 2026. Their architecture uses GPUs from any vendor for prefill, SambaNova’s specialized processors for decoding, and Intel Xeon 6 CPUs to coordinate execution and manage the software layer that supports agentic AI workloads.

One caveat is that technical advantages arising from codesigning do not automatically translate into widespread market adoption. NVIDIA’s success, for example, stems not only from the performance of its GPUs but also from the strength of its CUDA ecosystem, including its mature libraries and large developer community.

Over the long term, hyperscalers that codesign hardware and software may continue to achieve meaningful performance and efficiency gains, but broader adoption of their architectures will depend as much on the maturity of the surrounding ecosystem as on the capabilities of the underlying hardware. This dynamic could shift value creation away from stand-alone chip performance and toward integrated hardware-software platforms, creating advantages for companies that control larger portions of the stack.

Alongside ASIC adoption, programmable logic is also expected to play an important role. Field-programmable gate arrays provide flexibility for rapidly evolving AI workloads and may be used as companion devices for networking, memory expansion, and system-level acceleration, particularly in heterogeneous inference architectures where adaptability remains important.

Colorful iridescent semiconductor wafer with intricate microchip patterns displayed against a dark background.

The next era of semiconductor value creation

Read the article

Co-packaged optics: The bandwidth wildcard

Co-packaged optics (CPO) could ultimately deliver the largest long-term cost reductions among emerging AI infrastructure innovations. This technology integrates optical engines directly into a switch or accelerator package, replacing traditional pluggable optical modules located outside the chip package.

While many infrastructure innovations focus on improving compute or memory efficiency, CPO targets a broader and increasingly critical bottleneck: data movement across chips, servers, racks, and data center campuses. As scale-up domains expand beyond a single rack and scale-out fabrics connect distributed clusters, interconnect bandwidth and energy efficiency are becoming major constraints on inference economics.

The need for CPO is being driven by both architectural and physical realities. Mixture-of-experts (MoE) models, disaggregated inference architectures, and larger machine clusters are all increasing the volume of data that must move across AI systems. At the same time, copper-based electrical interconnects are approaching their practical limits. As lane rates increase from 224G (gigabits per second) to 448G and beyond, the distance over which copper can reliably transmit signals declines sharply. At future speeds of 1.6T (terabits per second) and above—levels likely required by next-generation inference systems—copper is no longer a viable medium for meaningful interconnect distances.

By moving optical connectivity directly onto the package, CPO reduces signal degradation and improves power efficiency. Industry participants estimate that CPO can reduce energy consumption per transmitted bit by approximately 50 to 65 percent relative to conventional pluggable optics while significantly increasing bandwidth density at the package edge. These improvements are increasingly necessary to support scale-up domains spanning multiple racks.

CPO also enables denser and more efficient network architectures. Because optical engines occupy less space than pluggable transceiver cages, switch designers can increase radix and create flatter network topologies with fewer switching layers and more direct connections between accelerators. The result is lower hop counts, improved bandwidth utilization, and reduced network complexity. Beyond scale-up domains, CPO-based switches can also improve scale-out fabrics connecting racks, rows, buildings, and campus-scale AI factories, enabling higher throughput over distances where copper becomes increasingly power-hungry and unreliable.

Industry activity increasingly reflects the strategic importance of this transition:

Broadcom has developed and tested prototype systems that use CPO for scale-up connections.
NVIDIA has incorporated CPO-enabled scale-up and scale-out architectures into its road map and recently invested approximately $2 billion each in optical technology providers Lumentum and Coherent.
Ecosystem coordination is also accelerating through the Open Co-Packaged Optics Multi-Source Agreement (MSA), whose members include companies such as Ciena, Coherent, Marvell, Samtec, and Terahop.
Investor confidence remains strong, with Ayar Labs raising approximately $500 million and Lightmatter approximately $400 million in recent funding.

Despite the momentum, most hyperscalers are currently deploying linear-drive pluggable optics (LPO) as a bridge technology. LPO captures some of the efficiency benefits of optical connectivity while avoiding many of the manufacturing, serviceability, and interoperability challenges that still accompany CPO. Adoption is therefore likely to be phased, beginning in the highest-bandwidth and most latency-sensitive clusters before expanding more broadly as the ecosystem matures.

Other considerations: Architecture evolution and chip development timelines

Beyond technology innovations, two additional considerations will shape how quickly—and how fully—companies can reduce inference costs. Both carry significant disruptive potential and can shift cost structures, hardware requirements, and competitive dynamics.

First, model design is evolving in ways that reshape inference costs. Advances such as MoE models, state space models (SSMs), and emerging non-autoregressive approaches reduce the underlying compute and memory required at inference time. These gains add to those from post-training techniques, such as quantization and pruning. Optimizing hardware for the prefill phase (which is compute-intensive) and the decode phase (which is limited by memory bandwidth) can also reduce cost per token.

These compounding algorithmic improvements are reducing inference costs, and that decrease has geopolitical implications because it decreases reliance on process node advances. This shift could benefit countries that have mature node capacity. For instance, China’s ability to produce 7-nanometer (nm) and larger-node chips at much lower cost means that highly optimized models running on mature hardware could make costs feasible for inference workloads, posing a competitive advantage.

The second shift relates to hardware evolution. New model architectures can emerge in months, but designing and manufacturing chips takes years. This mismatch creates a risk that available hardware is not well suited to current workloads, leading to underused capacity and wasted investment. Closing this gap will require faster design cycles through approaches such as chiplet architectures, better tooling, and AI-assisted design (including LLM-generated kernels that shorten the period required to get software running on new architectures). Developers must also create more flexible hardware that can quickly adapt to changing model needs.

Beyond 2030: Emerging compute paradigms

The innovations assessed in this article are built on complementary metal oxide semiconductor (CMOS)-based silicon architectures. If companies adhere to this standard, they are likely to achieve only one or two orders-of-magnitude improvements in cost and performance, with the gains diminishing over time. To go beyond these limits, researchers are exploring more fundamental shifts in compute aimed at relieving different bottlenecks in the AI compute stack along the supply chain. There are multiple vectors in physics that could provide solutions, many of which are worthy of much deeper evaluation. Three examples follow:

Photonic computing. This technique, which is drawing research attention in the semiconductor industry, uses light instead of electrons to perform computations. It can move data at extremely high speeds with very low energy loss compared to conventional architectures. Different wavelengths of light can carry data simultaneously, allowing more work to be done in parallel than traditional electrical connections. Early prototypes already show clear efficiency benefits for certain tasks, although challenges remain around precision, flexibility, and integration with existing electronic memory and control components.
Neuromorphic architectures. These systems mimic brain activity, computing only when there is a signal, rather than continuously cycling through operations. This event-driven approach reduces energy costs for inference. Researchers are still exploring whether neuromorphic architectures are well suited to the MoE-dominated, large-batch inference models that hyperscalers use. It is possible that they may first be used in edge inference or specialized sensing applications such as surveillance, gesture recognition, obstacle avoidance, and automotive sensing.
Quantum computing. This approach represents a more radical shift in how computation is done. Qubits—the basic units of quantum computing—can have a value of 1, 0, or both at once, allowing researchers to explore many possibilities in parallel. Its promise lies in tackling certain tasks, such as optimizing solutions, generating representative examples from a complicated set of possibilities, or simulating chemical reactions, that are extremely difficult for today’s computers. Whether quantum computing can meaningfully accelerate the matrix-heavy calculations used in neural network inference is still an open question. Building practical, large-scale quantum hardware is also challenging because of issues related to qubit stability (how long they can retain their quantum state), error correction, and system integration.

Across these emerging architectures, scalability challenges persist. To test and validate performance, each approach needs a new system-level environment that includes libraries, specialized compilers, and tailored frameworks to run workloads effectively. These tools are still in early development for most of the new technologies. How quickly these software stacks mature will be just as important for adoption as advances in the hardware itself.

The next phase of AI competition will be shaped as much by economics as by model capability. While advances in intelligence will continue, adoption will increasingly depend on how efficiently that intelligence can be delivered. For investors and industry leaders along the supply chain, the key question is not only which technologies lower the cost of inference, but also which companies control the capabilities required to deliver those reductions at scale. As AI systems grow more complex, value creation may increasingly shift beyond compute toward the software, packaging, memory, networking, optics, and manufacturing technologies that make efficient inference possible, reshaping both industry economics and competitive advantage.

Frontiers of compute: The technologies to reduce AI inference costs

About the authors

The inference adoption challenge

What is a token?

A parallel dimension of inference economics

Methodology: The cost-per-token framework

Would you like to learn more about our Semiconductors Practice?

The technology levers with the greatest impact

Model optimization: The most powerful near-term lever

Quantization

Pruning

Combined impact on inference economics

3D Advanced packaging: Structural efficiency gains

Custom silicon: Matching hardware to AI workloads

The next era of semiconductor value creation

Co-packaged optics: The bandwidth wildcard

Other considerations: Architecture evolution and chip development timelines

Beyond 2030: Emerging compute paradigms

Explore a career with us

Related Articles

The next era of semiconductor value creation

Where AI will create value—and where it won’t

Opportunities in networking optics: Boosting supply for data centers