Tens of thousands of H100 GPUs: Google builds top 2 supercomputers

The Google cloud is growing with a number of new data centers. Google doesn’t splash out and speaks of up to 26 exaflops of AI performance per new system – that corresponds to 26 trillion (26,000,000,000,000,000,000) operations per second. Customers can rent the computing power via future A3 instances, for example to train large language models.

In the blog post, Google speaks of A3 GPU supercomputers that the company is building all over the world. Each system uses the same hardware components that are scaled in different quantities.

Nvidia’s H100 GPUs and Intel’s fourth-generation Xeon Scalable processors, also known as Sapphire Rapids, are used. The systems are apparently based on Nvidia’s DGX100, so a cluster should consist of eight H100 accelerators and two Xeon SP CPUs. Nvidia’s built-in NV links and associated NV switches handle communication between the GPUs, with Google using its own software stack. Custom network processors (Infrastructure Processing Units, IPUs) developed together with Intel relieve the Xeon CPUs.

A Google spokeswoman confirmed to HPC Wire that tens of thousands of H100 GPUs will be deployed in the largest A3 data centers: “For our largest customers, we can build A3 supercomputers with up to 26,000 GPUs in a single cluster and are working on multiple Build clusters in our largest regions.” However, not every system gets as many GPUs.

On this scale, Google can compete with the currently fastest supercomputers in the world. Frontier, as the leader of the current Top500 list, manages more than one FP64 exaflop with thousands of AMD Epyc processors and Radeon Instinct MI250X GPUs.

In this data format, 26,000 H100 GPUs would achieve about 780 petaflops (0.78 exaflops) at best – the real performance should be rather lower over such a large network. Added to this would be the computing power of 6500 Intel CPUs (with two processors per cluster). The above 26 exaflops apply to simpler AI formats like Tensorfloat 32 (TF32) or FP16.

According to the current status, a fully equipped A3 supercomputer would comfortably occupy second place in the Top500 list. As a private company, however, Google will probably not carry out a corresponding Linpack benchmark run in order to end up in the list.


