The Ultra Ethernet Consortium (UEC) has dropped its new Ultra Ethernet Specification 1.0, and it’s a game-changer for AI networking, especially for the ‘backside’ networks that interconnect GPUs. (This post uses the term GPU to cover both GPUs and other AI accelerators.)

Here’s why it’s so critical for the future of AI infrastructure:

Given these advantages, Rivos has already adopted Ultra Ethernet (UE) for the networking interfaces built into our combination CPU+GPGPU Accelerator chips. Looking ahead, we expect to adopt the protocol extensions that support In-Network Collective (INC) operations. This feature, which will require corresponding switch capabilities, allows certain AI computations to happen directly within the network, significantly reducing data movement.

Use of Existing Switches

The UE 1.0 specification was explicitly written to work with existing Ethernet switches. This allows immediate deployment using existing infrastructure, and multi-vendor selection when building a new infrastructure. This contrasts with other technologies, like UALink, NVLink and Infiniband, which need specialized switches with single (or limited) vendor selection.

Note that the UEC was careful to understand and use the common optimization features that are already part of switches:

UE achieves robust congestion control by integrating observed packet behavior with the network-provided trimming and ECN data. This approach facilitates high-performance operation even on lossy networks, eliminating the complex configuration needed to manage older backpressure mechanisms, such as Priority Flow Control.

Ultra Ethernet Transport

The UET layer is built for efficient hardware implementation of the bulk data movement used in AI computations. AI networks are scaling rapidly, with 32,000-GPU clusters common today and 100,000-GPU clusters under development. In such environments, any GPU may need to perform remote memory access to any other GPU. While the total number of potential communication partners is vast, in a short time period any given GPU will typically only communicate with a smaller subset of its peers. Each communication involves a bidirectional connection.

UET streamlines connection setup and teardown. A connection is initiated by a flag in the first data packet and its acknowledgment. In the common case where the destination is ready, this process eliminates a round-trip delay, making it highly efficient. Connection teardown is equally streamlined. A hardware implementation can manage opening, maintaining and closing the active connections while still supporting the large number of total GPUs. Similarly, the data structures needed to track packet delivery, sequencing and retry are designed for easy hardware implementation.

Providing the transport layer in straightforward hardware allows multiple Ultra Ethernet interfaces to be built into a chip and this enables low latency data transfer to be initiated directly from code on the GPU.

Remote load/store and Atomic operations

Ultra Ethernet delivers the full suite of remote load/store and atomic operations, capabilities provided in more specialized protocols like UALink. Crucially, Ultra Ethernet achieves this using standard Ethernet protocols and readily available switches.

In contrast, UALink, despite a physical layer similar to Ethernet or PCIe Gen 7, mandates specialized switches that rely on proprietary protocols for low latency. While this might seem appealing, a more pragmatic approach leverages today’s standard low-latency ‘AI Ethernet Switches’ to achieve the latency benefits without requiring bespoke hardware. The UEC is actively developing compact, standard-compliant headers to further narrow any latency gap.

Beyond its specialized switch requirement, UALink imposes limitations on the total number of GPUs, forcing distinct ‘scale-up’ and ‘scale-out’ network architectures. Ultra Ethernet, however, provides messaging, remote DMA, load/store, and atomic operations over a single unified network. This means both scale-up and scale-out can be supported seamlessly, with domain sizes configurable by the user. In a shared cluster using standard Ethernet VLANs for isolation, different users can even define different sizes of scale-up and scale-out domains.

Conclusion

Efficient, scalable solutions are essential to meet the demands of today’s diverse AI workloads and the next generation of emerging models. Rivos integrates Ultra Ethernet into the network interfaces of our heterogeneous AI SoCs, tightly integrating communication with CPU, GPGPU, HBM and DDR5. This enables a balanced, high-performance platform optimized for the dynamic needs of AI workloads.

Ultra Ethernet not only delivers the bandwidth and efficiency required for remote memory access and large-scale AI workloads, but also supports the reuse of the existing Ethernet switch infrastructure, simplifying deployment and accelerating solution diversification. By merging the compatibility of standard Ethernet with the high-performance capabilities, once limited to proprietary protocols, Ultra Ethernet delivers the best of all worlds for scalable AI networking.

Other posts