AWS: How 500,000 Trainium2 Chips Power Project Rainier

Share this article
Share this article
Prioritise Us on Google
AWS's Trainium2 Chip. Credit: AWS
AWS has deployed Project Rainier, an AI compute cluster featuring nearly 500,000 Trainium2 chips, with Anthropic already running Claude workloads across US

AWS has activated Project Rainier, an AI compute cluster featuring nearly 500,000 Trainium2 chips spread across data centres in the United States and completed less than 12 months after AWS first announced the project at its re:Invent conference in December 2024.

Anthropic, the AI safety and research company, is using Project Rainier to train and deploy Claude, its foundation model, with AWS saying it expects Anthropic to scale to more than one million Trainium2 chips by the end of 2025 for workloads that include both training and inference operations.

Project Rainier represents a 70% increase in AWS’ AI computing infrastructure compared to previous deployments, providing Anthropic with more than five times the compute power the company used to train earlier versions of its models.

Youtube Placeholder

AWS Trainium2 delivers custom silicon for model training

The Trainium2 chip was designed by Annapurna Labs, AWS’ custom silicon division, for training foundation models and large language models. A single Trainium2 chip can complete trillions of calculations per second. The EC2 Trn2 instances feature 16 Trainium2 chips and deliver 20.8 peak petaflops of compute performance. AWS claims the instances offer 30 to 40% better price performance than current GPU-based EC2 instances.

The architecture uses Trn2 UltraServers, which combine four physical servers into one unit. Each UltraServer contains 64 Trainium2 chips interconnected through NeuronLink, a high-speed connection technology developed by AWS. The NeuronLink connections use blue cables to distinguish them from other networking infrastructure in the data centre.

AWS announced Trainium2 chip at its re:Invent conference in 2024. Credit: AWS

“Project Rainier is one of AWS’ most ambitious undertakings to date,” says Ron Diamant, Distinguished Engineer at AWS and Head Architect of Trainium. “It’s a massive, one-of-its-kind infrastructure project that will usher in the next generation of AI models.”

When multiple UltraServers are connected through Elastic Fabric Adapter networking technology, identified by yellow cables, they form what AWS calls an UltraCluster. This networking operates both within individual data centre buildings and across separate facilities.

Project Rainier spans multiple AWS data centre locations

Project Rainier is spread across multiple data centres rather than concentrated in a single location, and is named after the 4,392-metre stratovolcano that can be seen from Seattle on a clear day.

Youtube Placeholder

In an interview with CNBC, Matt Garman, CEO of AWS, said Anthropic is already running about 500,000 chips in Indiana. “And in fact, it’s going so well that they’ve actually doubled down on that order,” he said.

Mike Krieger, Chief Product Officer at Anthropic, told CNBC that AWS’ ability to deliver infrastructure at scale distinguishes the partnership. “These deals all sound great on paper,” he said. “But they only materialise when they're actually racked and loaded and usable by the customer. And Amazon is incredible at that.”

Amazon has invested US$8bn in Anthropic since the start of 2024, with the partnership including technical collaboration, with Anthropic providing input on infrastructure design.

AWS CEO, Matt Garman

Water efficiency measures at Project Rainier data centres

The data centres in St Joseph County, Indiana maximise the use of outside air for cooling, with AWS reporting a water usage efficiency of 0.15 litres of water per kilowatt-hour and saying its water usage efficiency represents a 40% improvement since 2021.

The electricity consumed by Amazon’s operations, including its data centres, was matched 100% by renewable energy resources in 2023. Amazon has been the largest corporate purchaser of renewable energy in the world for the past five years.

AWS announced it would roll out new data centre components that reduce mechanical energy consumption by up to 46% and reduce embodied carbon in concrete by 35%. The sites constructed to support Project Rainier include these upgrades. The company remains on a path to be net-zero carbon by 2040.

AWS Project Rainier. Credit: Amazon

Trainium3 chips to deploy at Project Rainier sites

In the interview with CNBC, Garman said AWS is preparing to deploy Trainium3, the third generation of its AI training chip. “It gives better performance, it gives better latency characteristics, it gets better power consumption per flop,” he told CNBC. “That will be deployed inside of Indiana. It’ll be deployed in many of our other data centres all around the world.”

Trainium3 is set to launch in the next few months. The chip was developed in collaboration with Anthropic, according to CNBC, with the AI company providing direct input to improve training speed, reduce latency and enhance energy efficiency.

“When we build our own devices, we get to optimise across the entire stack to really compress engineering time and the time to get to massive scale,” Diamant says.

Executives