AWS Trainium3 Cuts AI Data Centre Power Consumption by 40%

Share this article
Share this article
Prioritise Us on Google
AWS' Trainium3 chip. Credit: Amazon Web Services (AWS)
AWS says its third-generation custom chip Trainium3 provides a 50% cost reduction for enterprise training and inference workloads

Amazon Web Services has announced the general availability of Amazon EC2 Trn3 UltraServers, powered by the company's third generation Trainium chip built on three-nanometre technology.

The Trn3 UltraServers pack up to 144 Trainium3 chips into a single integrated system, delivering up to 4.4 times more compute performance than Trainium2 UltraServers, with customers able to achieve three times higher throughput per chip while delivering four times faster response times than Trn2 UltraServers using OpenAI's open weight model GPT-OSS.

Matt Garman, CEO of AWS, says: "We ran through several open source models – all the workloads that we've been optimising to run on Trainium2 – to see how they run on Trainium3." The CEO reported in his keynote at AWS re:Invent that Trainium3 offers efficiency gains over Trainium2, for five times higher output tokens per megawatt, all while maintaining the same latency.

Performance and energy efficiency improvements

David Brown, VP of AWS Compute and Machine Learning Services, told Technology Magazine: "It's got 4x more performance over Trainium2, which is fantastic. It's also going to be 40% more performance per watt, which is obviously very, very important as we think about how much compute we can get out of every watt of power that we put into the device. And that makes it about 40% better energy efficient as well when we go from Trainium2. We've also increased the memory bandwidth by 50%."

These performance improvements could prove significant for data centre operators managing power consumption and computational demands. The enhanced energy efficiency translates to more processing capability within existing power infrastructure constraints, a consideration that continues to shape data centre planning decisions.

Trainium architecture powers Amazon Bedrock

AWS has deployed over one million Trainium chips starting this year, with the majority of Amazon Bedrock inference workloads now running on Trainium architecture. Matt explains: "If you look at all the inference running on Amazon Bedrock today, the majority is actually powered by Trainium, and the performance advances of Trainium are really noticeable. For Anthropic's latest generation models in Bedrock, all of that traffic is running on Trainium, which is delivering the best response times compared to any other major provider."

The Trainium family has evolved rapidly since its introduction in 2020. AWS engineered the Trn3 UltraServer as a vertically integrated system from chip architecture to software stack. The new NeuronSwitch-v1 delivers twice more bandwidth within each UltraServer, while enhanced Neuron Fabric networking reduces communication delays between chips to under 10 microseconds.

For customers requiring greater scale, EC2 UltraClusters 3.0 can connect thousands of UltraServers containing up to one million Trainium chips, representing 10 times the previous generation. Matt adds: "We've gotten to a million chips in record speed, not just because we control the whole stack, but because we can optimise end-to-end how we roll it out."

Rapid deployment and future development

Through Project Rainier, AWS collaborated with Anthropic to connect more than 500,000 Trainium2 chips into what the company describes as the world's largest AI compute cluster. David says: "We spoke about Project Rainier at re:Invent last year, so early December 2024, and by October, less than a year later – 10 months later – we were deploying it."

He adds that the company expects to scale Trainium3 faster than Trainium2: "I would actually expect us to scale Trainium3 even faster than Trainium2. One thing we constantly learn is it's not only about how quickly you can make the silicon – and you have to make it well, because any mistake in the silicon, you're going to lose way too much time."

Customers including Anthropic, Karakuri, Metagenomics, Neto.ai, Ricoh and Splashmusic are reducing training and inference costs by up to 50% with Trainium technology. Decart, an AI laboratory specialising in efficient generative AI video and image models, is achieving four times faster frame generation at half the cost of graphics processing units.

AWS also announced it is actively developing Trainium4, designed to bring performance improvements including at least six times the processing performance in FP4 precision, three times the FP8 performance and four times more memory bandwidth. FP8 is the industry-standard precision format that balances model accuracy with computational efficiency for AI workloads. Trainium4 is being designed to support Nvidia NVLink Fusion high speed chip interconnect technology, enabling Trainium4, Graviton and Elastic Fabric Adapter to work together within common MGX racks, providing rack-scale AI infrastructure that supports both graphics processing unit and Trainium servers. Amazon EC2 Trn3 UltraServers are available now.

Matt Garman, CEO of AWS. Credit: Amazon Web Services (AWS)

Matt Garman, CEO of AWS, says: "We ran through several open source models – all the workloads that we've been optimising to run on Trainium2 – to see how they run on Trainium3." The CEO reported in his keynote at AWS re:Invent that Trainium3 offers efficiency gains over Trainium2, for five times higher output tokens per megawatt, all while maintaining the same latency.

Performance and energy efficiency improvements

David Brown, VP of AWS Compute and Machine Learning Services, told Technology Magazine: "It's got 4x more performance over Trainium2, which is fantastic. It's also going to be 40% more performance per watt, which is obviously very, very important as we think about how much compute we can get out of every watt of power that we put into the device. And that makes it about 40% better energy efficient as well when we go from Trainium2. We've also increased the memory bandwidth by 50%."

David Brown, Vice President of Compute and Machine Learning at AWS. Pic: AWS

These performance improvements could prove significant for data centre operators managing power consumption and computational demands. The enhanced energy efficiency translates to more processing capability within existing power infrastructure constraints, a consideration that continues to shape data centre planning decisions.

Trainium architecture powers Amazon Bedrock

AWS has deployed over one million Trainium chips starting this year, with the majority of Amazon Bedrock inference workloads now running on Trainium architecture. Matt explains: "If you look at all the inference running on Amazon Bedrock today, the majority is actually powered by Trainium, and the performance advances of Trainium are really noticeable. For Anthropic's latest generation models in Bedrock, all of that traffic is running on Trainium, which is delivering the best response times compared to any other major provider."

The Trainium family has evolved rapidly since its introduction in 2020. AWS engineered the Trn3 UltraServer as a vertically integrated system from chip architecture to software stack. The new NeuronSwitch-v1 delivers twice more bandwidth within each UltraServer, while enhanced Neuron Fabric networking reduces communication delays between chips to under 10 microseconds.

For customers requiring greater scale, EC2 UltraClusters 3.0 can connect thousands of UltraServers containing up to one million Trainium chips, representing 10 times the previous generation. Matt adds: "We've gotten to a million chips in record speed, not just because we control the whole stack, but because we can optimise end-to-end how we roll it out."

Rapid deployment and future development

Through Project Rainier, AWS collaborated with Anthropic to connect more than 500,000 Trainium2 chips into what the company describes as the world's largest AI compute cluster. David says: "We spoke about Project Rainier at re:Invent last year, so early December 2024, and by October, less than a year later – 10 months later – we were deploying it."

AWS Project Rainier. Credit: Amazon

He adds that the company expects to scale Trainium3 faster than Trainium2: "I would actually expect us to scale Trainium3 even faster than Trainium2. One thing we constantly learn is it's not only about how quickly you can make the silicon – and you have to make it well, because any mistake in the silicon, you're going to lose way too much time."

Customers including Anthropic, Karakuri, Metagenomics, Neto.ai, Ricoh and Splashmusic are reducing training and inference costs by up to 50% with Trainium technology. Decart, an AI laboratory specialising in efficient generative AI video and image models, is achieving four times faster frame generation at half the cost of graphics processing units.

AWS also announced it is actively developing Trainium4, designed to bring performance improvements including at least six times the processing performance in FP4 precision, three times the FP8 performance and four times more memory bandwidth. FP8 is the industry-standard precision format that balances model accuracy with computational efficiency for AI workloads. Trainium4 is being designed to support Nvidia NVLink Fusion high speed chip interconnect technology, enabling Trainium4, Graviton and Elastic Fabric Adapter to work together within common MGX racks, providing rack-scale AI infrastructure that supports both graphics processing unit and Trainium servers. Amazon EC2 Trn3 UltraServers are available now.

Executives