Nvidia Ethernet Networking to Accelerate xAI Supercomputer

Share
Colossus has been touted as the world’s largest AI supercomputer
NVIDIA Spectrum-X networking platform enables Colossus, xAI’s large AI supercomputer, to accelerate training of 100,000 GPUs for training complex LLMs

Leading global chipmaker Nvidia has announced that xAI’s Colossus supercomputer is set to double in capacity. 

Both companies have announced that they are in the process of adding 100,000 NVIDIA Hopper GPUs in Memphis to the cluster in Tennessee. They achieved this massive scale by using the NVIDIA Spectrum-X Ethernet networking platform.

The platform from Nvidia is designed to deliver superior performance to multi-tenant hyperscale AI factories using standards-based Ethernet for its Remote Direct Memory Access (RDMA) network.

Likewise, the supporting facility and state-of-the-art supercomputer was built by xAI and NVIDIA in just 122 days, instead of the typical timeframe for systems of this size that can take years. It also took 19 days from the time the first rack rolled onto the floor until training began.

Mission-critical supercomputing

Colossus has been touted as the world’s largest AI supercomputer and is currently being used to train xAI’s Grok family of large language models. As part of the company’s offering, chatbots are a feature for X Premium subscribers. 

Moving forward, xAI is in the process of doubling the size of Colossus to a combined total of 200,000 NVIDIA Hopper GPUs.

As it was training the large Grok model, Colossus achieved unprecedented network performance. Across all three tiers of the network fabric, the system has experienced zero application latency degradation or packet loss due to flow collisions, according to Nvidia and xAI. It has also maintained 95% data throughput enabled by Spectrum-X congestion control.

This level of performance cannot be achieved at scale with standard Ethernet, which creates thousands of flow collisions while delivering only 60% data throughput.

“AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency,” says Gilad Shainer, Senior Vice President of Networking at NVIDIA. 

Gilad Shainer, Senior Vice President of Networking at NVIDIA

“The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions.”

Pushing the boundaries of AI development

The Spectrum-X platform has at its heart the Spectrum SN5600 Ethernet switch, which supports port speeds of up to 800Gb/s and is based on the Spectrum-4 switch ASIC. 

Nvidia highlights in its statement that xAI chose to pair the Spectrum-X SN5600 switch with NVIDIA BlueField-3® SuperNICs for unprecedented performance.

Youtube Placeholder

Spectrum-X Ethernet networking for AI offers a range of advanced features that are designed to deliver highly effective and scalable bandwidth with low latency and short tail latency, previously exclusive to InfiniBand. 

These features include adaptive routing with NVIDIA Direct Data Placement technology, congestion control, in addition to enhanced AI fabric visibility and performance isolation — which are all key requirements for multi-tenant generative AI clouds and large enterprise environments.

“NVIDIA’s Hopper GPUs and Spectrum-X allow us to push the boundaries of training AI models at a massive-scale, creating a super-accelerated and optimised AI factory based on the Ethernet standard.”

xAI spokesperson

This announcement comes during a time where AI chip capabilities are in high demand across the data centre industry, leading more giants like Nvidia to partner with other organisations in order to develop their offerings.

The expansion of the Colossus supercomputer, powered by Nvidia technology, demonstrates the urgent demand for massive computational power in AI training, which is inevitably pushing the boundaries of data centre capabilities. 

A new era of advanced networking is required to meet the requirements of modern AI supercomputers.


Make sure you check out the latest edition of Data Centre Magazine and also sign up to our global conference series - Tech & AI LIVE 2024


Data Centre Magazine is a BizClik brand

Share

Featured Articles

5 Minutes With: Natham Blom, co-CEO of Iceotope Technologies

Data Centre Magazine speaks exclusively with Nathan Blom, co-CEO of Iceotope Technologies, about how liquid cooling can support sustainable AI growth

Google Partnership to Confront Data Centre Energy Challenges

Intersect Power and TPG Rise Climate are joining forces with the technology giant Google to develop co-located clean energy facilities for data centres

Vertiv & Compass Datacenters: Boosting Liquid Cooling for AI

Vertiv & Compass Datacenters develop combination liquid and air cooling systems to accelerate deployment of liquid cooling for data centre AI applications

Microsoft Unveils Zero-Water Cooling for AI Data Centres

Hyperscale

How IBM's Optical Breakthrough Could Support AI Data Centres

Technology & AI

How Evolution Data Centres & CREC Power Green Data Centres

Hyperscale