Systems on a chip comes to the data center

By Stephen Simpson

For decades, hardware architectures have used a combination of central processing units (CPUs), memory, external storage, and network in a uniform way. Since innovation required substantial investment, the result has been restrictive commoditization, with chip manufacturers lacking any incentives to provide bespoke solutions to industries or lay out specific use cases.

Similarly, the chip industry has become significantly more homogenized as companies have moved toward standardized architecture and chip fabrication. This trend has meant that cloud suppliers are increasingly encountering challenges in central processing power and bottlenecks from network communications latency.

Recently, however, data-center computer manufacturers and even several retail computer manufacturers have announced plans to produce their own chips that integrate the former individual systems on a chip. Bespoke chips and circuitry, often built using the latest technology (such as an advanced five-nanometer process) have demonstrated significant advantages in performance and power consumption.

By taking a systems-on-a-chip (SoC) approach, these manufacturers can now optimize performance, cost, and power consumption simultaneously by tailoring the electronics to the needs of the business and optimizing the design for specific calculations. Several industries have improved performance by 15 to 20 percent while significantly reducing cost and production time for the fabrication of bespoke chips. This is especially true for focused use cases, such as large-scale optimization, where numerous problems of similar nature need to be solved in parallel, or bitcoin mining.

To capture this value, companies should understand the implications and trade-offs before they incorporate SoC offerings into their operations.

Navigating design and production

Currently, companies can engage specialty vendors to design bespoke chips and then send them out to a foundry in East Asia to be manufactured at scale. The use of modern advanced analytics techniques, such as neural networks with reinforcement learning, can optimize chip floor-plan characteristics, such as the weighted average of the total wire length and the density and congestion of SoC circuitry.

An important challenge for the company commissioning the chip is often a legal one: it requires the identification, assembly, and licensing of all the patented technologies needed for a composite design. As a result, care needs to be taken to ensure that the integration of hardware across different vendors’ intellectual property does not significantly hinder performance. Today, these lower-cost design alternatives are licensed predominantly on the ARM architecture, with the RISC-V open-standard instruction set emerging as a viable alternative.

Data-center servers

Both cloud vendors and specialized CPU and computer-hardware providers are taking advantage of this SoC approach for their traditional servers. By innovating quickly, they are starting to enjoy significant success. In certain areas, ARM-based designs are nibbling away at the X86/X64 processor’s dominant position. Several hyperscalers have announced their move to proprietary chip designs in their data centers. New server designs typically run at slightly lower clock speeds compared with those of traditional systems to achieve significant energy savings and allow more cores to be packed together. An SoC approach can accommodate at least 160 cores per server, together with an exceptional state-of-the-art network, memory, and connectivity options. As an example, servers with up to four terabytes of memory and 128 lanes with a bandwidth of two gigabytes per second each are running in cloud data centers today.

This shows that the solution to the aforementioned problems in optimization and mining have come a long way from being theoretical applications for focused chip design to being key problems in industry solved by dedicated machinery. Similarly, this also represents a major new development for hyperscalers, as they will be required to make dedicated machinery themselves going forward instead of purely relying on typical vendors to produce the right chips for them.

Accelerating data pipelines

Although exciting innovation is taking place in the core data-center servers, the data-intensive network-communication latency must also be addressed. The emerging trend is to offload these responsibilities, including encryption and data loading, to a dedicated processing unit that also provides advanced security and accelerated data-movement capabilities. A range of companies offer technologies that differ considerably in their sophistication and price points. These technologies are usually positioned as SmartNICs (wired networking and computational resources on the same card to offload tasks from server CPUs) and may be based on field programmable gate array (FPGA), application-specific integrated circuit (ASIC), or SoC technology. A data-processing unit (DPU)—a specific type of SoC—has been employed in several new chip designs.

Some products go significantly further and offer important capabilities in data-pipeline management to significantly reduce network latency, support inline crypto acceleration, enable the highly secure “enclave” isolation of different organizations’ data sets, and provide the ability to feed network data directly into graphics processing units (GPUs) for machine learning predictions. Of course, the capabilities of these devices need to match those of the next-generation servers.

Benefiting from the changes

The market is changing quickly, with cloud vendors increasingly providing capabilities to clients without considering hardware. This is relevant for two key reasons.

First, some of the significant cost savings will be passed on to customers. We have seen vendors claim they can deliver seven times more performance, four times more compute cores, five times faster memory, and two times larger caches and add even more choice to help customers optimize performance and cost for their workloads. These promises are indicative of the broader changes in the marketplace. But it means buying three-year up-front reserved instances—rarely a good idea from a financial standpoint—making it less appealing. This dynamic adds an extra dimension to the complexity of financial planning for the cloud. Gaining a thorough understanding of the implications of your cloud vendor’s road map is increasingly important. And the same cloud vendor will likely offer more differentiated pricing (much greater than is the case today) because of the age of the hardware running across different regional data centers.

Second, the acceleration of data-pipeline changes means data architecture will also start to evolve, so it makes sense for companies to consider a different approach. Priority tasks include pulling data out of current systems, reducing costs and improving workflow, determining the organization’s data-processing capabilities, handling large amounts of data quickly, and taking advantage of important information-security improvements to ensure adherence to security protocols and regulatory requirements.

Organizations must carefully evaluate their current cloud-deployment architecture and evaluate how to best harness new setups proposed by cloud vendors based on proprietary hardware. In addition, they should assess the cost and timeline of these contracts to optimize new technologies. Since the hardware and possible efficient solution of the aforementioned use cases now become available as a service, leveraging these new services when encountering specific optimization problems will be key. This exercise includes both the design of the actual software as well as the management of interfaces and data exchange.

Stephen Simpson, based in London, is a senior principal at QuantumBlack, a McKinsey company.