Introduction
Continuous advancements aimed at supporting faster speeds, lower latency, and more robust data handling have marked the evolution of networking technology. The most recent development in this line promises to revolutionize high-performance networking for contemporary data-centric applications like artificial intelligence (AI), machine learning (ML), and large-scale data analytics. Ultra Ethernet is the latest innovation in this line. More than a hundred member businesses have joined the Ultra Ethernet collaboration since it began in 2023. The specification’s first version, version 1.0, has been released, but actual development—on the application, OS, and silicon—is still ongoing. Ethernet difficulties The typical business traffic is extremely diverse and flexible in its requirements. On the other hand, AI/ML networks have consistent but stringent requirements. They are highly intolerant of latency and unreliability, have a small number of powerful endpoints, and no part of the job is finished until every part of the job is finished. What traditional Ethernet hasn’t done well: Load balancing: Ethernet doesn’t really load balance; rather, it does “statistical load distribution” that makes use of the standard 5 tuples pulled from IP addresses, ports and protocols. With limited data to differentiate between extra-large and long flows, Ethernet can (and does) try to cram too much data down one pipe while leaving the rest unused.
Ethernet does not guarantee in-order delivery, which is required by HPC, and relies on the transport and application layers to do so. Reliability: In the same way, Ethernet does not promise that a datagram sent will be delivered. This is addressed by RDMA operations using a Go-Back-N strategy: drop anything, rewind to the “last known good” message, and then restart from there. A small number of drops can have a significant (more than 60 percent) impact on the amount of time required to complete a task. DCQCN: Scaling, tuning, and adapting the existing PFC/ECN/DCQCN congestion management systems is challenging. InfiniBand has been the preferred RDMA transport for the past 20 years, and the Ultra Ethernet Consortium (UEC) was established in response to this.
The goals of the UEC are simple:
Match/Exceed InfiniBand performance
Retain the scale and flexibility of Ethernet
Adapt RDMA to the current era. In keeping with this, Ultra Ethernet introduces a refined packet structure designed specifically for workloads requiring high performance. The following are important aspects of the Ultra Ethernet specification: Extended header fields: The packet headers feature extended fields for advanced metadata, enabling more precise flow classification and improved routing decisions. These fields also support enhanced quality of service (QoS) and security tagging.
Profiles for traffic: Ultra Ethernet provides three fundamental profiles for distinct types of traffic. AI Base: Focuses on core “common” AI use cases, including distributed inference and training.
Deferrable Send, exact-matching tagging, and extended atomic operations are just a few of the AI Full features. HPC: Full rendezvous support for HPC workloads beyond AI, more atomic operations, and advanced tagging semantics. Collectives in the network: Ultra Ethernet packets contain markers like reductions and transformations that make processing in the network easier. This is an evolution/application of FPGA (Field Programmable Gate Array) technology and is particularly valuable for AI/ML workloads where intermediate computations can be performed in transit.
High-precision timestamps: Each packet contains high-precision timestamps, allowing for precise latency measurements and maintaining synchronization across distributed systems. Management of congestion: More refined mechanisms for throttling back (or opening up) throughput. This includes things like: In order to move quickly from zero to wire, ephemeral connections Optimizations for ultra-low latency “short RTT” environments
Using packet spraying to get the most out of your bandwidth Quick & granular recovery: Ultra Ethernet moves some retransmit and recovery functions to the Link Layer, and packet trimming allows for granular/selective retransmits of lost data.
Stack of UET protocols
Full-stack hardware with support for UET (Ultra Ethernet Transport), the UEC product, is expected to be available in late 2025 or early 2026. It is essential to comprehend that the majority of current hardware is documented as “supporting Ultra Ethernet,” indicating that it will accept packets, but does not yet implement the full suite of enhancements that will eventually be included in the specification.
Overview
Updated components of the Physical, Link, and Transport layers make up the majority of Ultra Ethernet, as do improved congestion management strategies that can handle extremely large environments. The UET protocol stack is depicted in this graphic, which also includes remarks about major features at each layer.