AI workloads demand unprecedented levels of bandwidth, low latency, and deterministic communication across increasingly dense compute infrastructures. This work focuses on emerging network architectures tailored for AI servers, racks, and clusters—highlighting trends such as high-radix topologies, RDMA over converged Ethernet (RoCE), optical interconnects, and in-network compute. It examines how networking shapes system performance, scalability, and efficiency, and outlines architectural strategies to address bottlenecks in collective communication, model parallelism, and distributed training at hyperscale.
IOWN Technology Director, IOWN Development Office, R&D Planning Department, NTT
Masahisa Kawashima is currently leading NTT's R&D of Innovative Optical and Wireless Network (IOWN) as the IOWN Technology Director. He is also serving as the Chair of the Technology Working Group at IOWN Global Forum. He has been working as a bridge between technologies and businesses... Read More →
Mohamad El-Batal is the Seagate Enterprise Data Solutions (EDS) Chief Technologist, and part of the overall Seagate Office of the CTO team. He is given the opportunity to shape the Seagate EDS strategy and future storage product technology roadmap.In his career Mohamad lead a team... Read More →
RoCEv2 is getting widely deployed due to emerging GenAI trend, and there is growing needs to mix AI workload and HPC workload to maximize infrastructure investment efficiency, RoCEv2 which is developed decades ago for simple workload starts to show its issues in hyperscale deployment, this leads to the development of UEC – Ultra Ethernet Consortium.
MUFG Bank and NTTDATA worked with the IOWN Global Forum to create use cases and Reference Implementation Models demonstrating how APN and Optical DC networks can transform financial systems. A white paper was published detailing these innovations and their applications. MUFG Bank, NTTDATA, and NTT West tested their ideas through pilot experiments, yielding two major outcomes: real-time database synchronization reduces reliance on complex backup frameworks, and virtual instances enable seamless Data Center transitions without downtime, enhancing efficiency. This allows financial institutions to meet evolving customer demands more effectively. Research is ongoing to integrate Optical Network technology with OCP hardware to address latency, bandwidth, and prioritization issues for NIC-to-NIC communication. By combining software, hardware, and networks, NTTDATA aims to create smarter, more resilient financial systems.
FBOSS is Meta’s own Software Stack for managing Network Switches deployed in Meta’s data centers. It is one of the largest services in Meta in terms of the number of instances deployed.
Network Traffic in AI Fabric presents unique challenges such as “elephant flows” (a small number of extremely large, continuous flows), and low entropy (limited variation in flow characteristics, increasing likelihood of hash collisions).
At OCP 2024, we showcased how we evolved FBOSS to tackle these challenges. This solution is capable of building non-blocking clusters for up to 4K GPUs. However, generative AI use cases demand significantly larger non-blocking clusters. This can be solved by interconnecting multiple 4K GPU clusters into a single, larger cluster using traditional Routing and ECMP. In this design, intra-cluster traffic benefits from non-blocking I/O, but inter-cluster traffic continues to suffer from poor network performance due to the aforementioned elephant flows and low entropy.
In this talk, we will share our journey evolving FBOSS for generative AI workloads. We will discuss the hierarchical design that enables us to build significantly larger non-blocking clusters, the unique challenges we encountered in scaling both the dataplane and control plane, and the solutions we developed to overcome them. Additionally, we will highlight the SAI enhancements that were instrumental in adapting FBOSS to support the demands of generative AI.
Highlight the CXL RAS specific Firmware First error handling use cases that are implemented for the CXL specification which is generic in nature. When developing the support for the CXL RAS Firmware First, encountered different use cases that needs to be solved to implement the firmware and customer problems. The use cases resolution would involve the design and implementation of the firmware to handle them. Usage of primary and secondary Mailbox to overcome the IHV early adoption in engineering/debug effort. Common error signaling protocols usage for protocol errors. GUID and UUID usage in firmware on both CPU/host side, CXL devices and Operating system. Interaction communication failures between the CPU and CXL devices during boot time and run time that needs to be notified to users/operating system. Techniques used to improve the SMI latency. Error pollution use cases handling on both protocol, Memory error firmware notification (MEFN) and Flat2LM cases. The CXL ecosystem comprises of multitude of component vendors like SoC, Memory, Storage, Networking, etc. The explosive growth of internet content and the resulting data storage and computation requirements has resulted in the deployment of heterogenous and complex solutions in the very large-scale data centers. These warehouse sized buildings are packed with server, storage and network hardware. Specifically if there is a uncorrected fatal error detected by hardware that pose a containment risk. The system needs to be reset and restarted if possible to enable continued operation. The error affects the entire CXL device, a persistent/permanent memory device is considered to have experienced a dirty shut-down.
While AI/ML clusters continue to scale and are breaching the boundaries of physical locations in terms of both size and power - the need to scale and interconnect different locations becomes ever more crucial.
When new challenges of interconnected locations are extended to these use-cases, few considerations have to be met: - Allowing high bandwidth to be effectively used between geographically dispersed location through various distances - Support for lossless RDMA traffic - Simple and condensed interconnection layer
The presentation will focus on how Broadcom’s Jericho product line allows for the implementation of such needs with innovations throughout the stack - from physical connectivity and all the way to intelligent load-balancing.
This presentation will focus on an innovative dynamic Explicit Congestion Notification (ECN) threshold testing methodology, emphasizing the design rationale for test cases and the observational analysis of experimental results. We will explore how designed test cases trigger ECN threshold changes in dynamic network environments, ensuring comprehensive and effective testing.
A key insight from our research is the critical role of qp-fairness (Queue Pair fairness) in collective benchmarking, alongside traditional metrics like algorithmic bandwidth and bus bandwidth. Through comparative analysis of real-world test data, we demonstrate how maintaining qp-fairness under dynamic conditions significantly enhances the stability of ECN mechanisms and ensures equitable allocation of network resources. By aligning theoretical insights with practical implementations, we hope to provide actionable insights for advancing research and applications in dynamic ECN technologies.
This session will share our practical experience in AI data center architecture planning, covering the critical interplay of computing, storage, and management. We'll delve into our comprehensive approach to designing scalable and efficient AI infrastructure.
A significant portion of our discussion will be dedicated to our innovative storage architecture, which addresses diverse AI workload demands through a dual-part strategy. First, we will present our high-performance storage solution, leveraging BeeGFS and GRAID's cutting-edge technologies with NVMe SSDs to meet the demands of intense AI computation. Second, we will explore our approach to tenant object storage, specifically utilizing GRAID's SupremeRAID alongside NVMe SSDs to provide robust and scalable data management for various user requirements.
NTT is considering Data-Centric Infrastructure (DCI) using IOWN technologies. DCI processes data efficiently by combining geographically dispersed resources. To achieve this, we’re verifying Composable Disaggregated Infrastructure (CDI) – a flexible hardware solution – and considering a multi-vendor approach. CDI consists of servers, PCIe expansion boxes, and switches, enabling software-controlled allocation of accelerators for optimal performance. Utilizing multi-vendor CDI requires an interface like OFA Sunfish to reduce operational costs. Our verification has revealed challenges in the physical operation of CDI and implementing a multi-vendor configuration. These include increased cabling costs, racking limitations, and inconsistencies in product functionalities and procedures requiring careful configuration management. This session will share these challenges and proposed solutions.
The proposed Layer 2 transparent network, bridging VM and Container networks, is a software-defined network for AI services deployment. The cloud provider offers a tenant-aware and transparent combining VM and Container network into the same network domain. The benefits of this network are to provide the full Layer 2 network and to reduce communication overhead in the multi-tenant cloud system. The tenant can deploy its services in VMs and Containers. The communications among VMs and containers are in the same Layer 2 domain. It could reduce routing efforts and isolate the network traffic among different tenants.
Decades-old copper and optical interconnect technologies limit AI cluster compute efficiency. The presentation will showcase e-Tube Technology - RF data transmission over plastic waveguide - and how it breaks the barriers of these legacy technologies by providing near-zero latency and 3x better energy efficiency than optics at a cost structure similar to copper. e-Tube is an ideal replacement for copper for terabit interconnect to scale up next-generation AI clusters.
The demands of AI training and inference workloads heavily rely on efficient I/O operations. This presentation will explore how SSDs are architected to deliver superior performance across diverse I/O characteristics inherent in these AI processes. We'll demonstrate the advantages of NVMe SSDs in both training and inference environments, highlighting their strengths in handling various I/O patterns. Furthermore, we will delve into the critical aspect of checkpointing performance in AI pipelines, specifically showcasing how FDP significantly enhances checkpointing efficiency by mitigating bandwidth limitations and reducing contention in shared storage systems.
Mohamad El-Batal is the Seagate Enterprise Data Solutions (EDS) Chief Technologist, and part of the overall Seagate Office of the CTO team. He is given the opportunity to shape the Seagate EDS strategy and future storage product technology roadmap.In his career Mohamad lead a team... Read More →
As data traffic continues to surge across AI networks, the need for higher bandwidth and efficient signal connectivity is critical. With 200G/lambda generation well on the way to production, focus is quickly moving to the next 400G/lane SerDes, which represents a significant leap in interconnect performance. This advancement enables interconnects capable of reaching 3.2Tbps and beyond by aggregating fewer, faster lanes with the need to balance cost, power consumption and footprint per bit. In this presentation, we delve into the high-speed protocols such as Ethernet, UALink, and Ultra Ethernet – exploring the first use case where 400G/lane SerDes will potentially be deployed. We’ll take a deeper look into different modulation formats with their benefits and challenges. Special attention will be given to the adoption of optical connectivity. We aim to provide a comprehensive overview of the options available and justify their use in modern cloud service architectures.
Traditional network infrastructure observability tools fall short in AI environments, where interdependence between networking and computing layers directly impacts inference latency and throughput. Modern AI workloads—particularly large language models and computer vision pipelines—demand synchronized visibility across the data transport path (RDMA/GPU-to-GPU) and GPU execution stack to ensure performance consistency, avoid bottlenecks, and support real-time SLAs.
Our panelists will share their views and real world learnings on the required observabiolitty paradigm shiftings in opened networking in terms of architecture design, telemetry stack, policy engine, etc. that drives closed loop observability
Stefan is a growth-focused and dynamic executive with extensive experience in leading all facets of technical operations. Stefan is currently serving as CTO of Dorado Software, a leading provider of Fabric Orchestration and Management for Enterprise, Cloud and Telco. Stefan's prior... Read More →
The rapid growth of AI chips has increased computational demands, and future high-performance computing (HPC) systems are expected to integrate multiple high-power chips, resulting in total power consumption of over 2.5kW and individual chip power densities exceeding 200W/cm². To tackle these challenges, advanced cooling technologies are essential to lower thermal resistance and efficiently dissipate heat. In this paper, we explore innovative structural designs for cold plates that address critical thermal management challenges for next-generation AI systems, as well as the corresponding thermal test vehicle that can generate different power densities.
Inference processing of large language models (LLMs) is computationally intensive, and efficient management and reuse of intermediate data, known as KV Cache, are crucial for performance improvement. In this presentation, we propose a novel architecture leveraging NTT's innovative photonics-based networking technology, "IOWN APN (All-Photonics Network)," to enable low-latency, high-bandwidth sharing of large-scale KV Cache among geographically distributed data centers. By exploiting the unique capabilities of IOWN APN, the proposed KV Cache sharing system significantly enhances inference throughput and improves power efficiency, paving the way for reduced environmental impact and more sustainable operational models for LLM inference. Through this presentation, we aim to engage with the OCP community to discuss the potential for wide-area distributed AI computing based on open standards.
In the realm of AI networks, the health of physical links is paramount to ensuring optimal performance and reliability. At Meta, we recognize that robust physical connectivity is crucial for the seamless operation of AI workloads, which demand high-speed and reliable data transmission. This presentation will delve into Meta's comprehensive strategy for maintaining healthy physical links within our AI networks.
We will explore the significance of link health in AI networks, emphasizing how it impacts overall system efficiency and performance. Meta employs advanced physical layer diagnostics, including Pseudo-Random Binary Sequence (PRBS) and Forward Error Correction (FEC) diagnostics, to rigorously test and validate link integrity before deployment into production. These diagnostics help identify potential issues, ensuring only healthy links are operational.
Furthermore, we will discuss Meta's proactive approach to managing link health in production environments. Unhealthy links are swiftly removed from service, and an automated triage pipeline is employed to facilitate effective repairs. This pipeline not only enhances the speed and accuracy of link restoration but also minimizes downtime, thereby maintaining the high reliability standards expected in AI network operations.
This presentation delves into challenges and opportunities for AIaaS-providers to efficiently deploy and manage multi-tenant AI fabrics and clusters. Deployment of SONiC AI infrastructure with optimal tuning especially for AIaaS provider, or and Enterprise supporting Inference at Edge can be a complex and daunting task. We will present the required features to simplify deployment of backend AI SONiC Fabrics in a controller. Tuning fabrics supporting AI must be take into consideration factors such as, AI job type as well as its sensitivity to latency, tier of the tenant scheduling the job, and tuning capabilities of the underlying SONiC platforms, and implement an adaptive solution. The presentation introduces the concept of AI tenancy,, and how tenancy could be considered when orchestrating and tuning the underlying infrastructure.
Stefan is a growth-focused and dynamic executive with extensive experience in leading all facets of technical operations. Stefan is currently serving as CTO of Dorado Software, a leading provider of Fabric Orchestration and Management for Enterprise, Cloud and Telco. Stefan's prior... Read More →
As AI models and data analytics workloads continue to scale, memory bandwidth and capacity have become critical bottlenecks in modern data centers. CXL provides a high-capacity, low-latency memory expansion that can be leveraged in different usage models. CXL memory expansion and Pooling can significantly enhance SQL workload performance and reduced cloud TCO, particularly for in-memory databases and analytics workloads that are bandwidth and capacity constrained. Also Offloading the key-value (KV) cache to Compute Express Link (CXL) memory is emerging as an effective strategy to tackle memory bottlenecks and improve throughput in large language model (LLM) inference serving by storing KV cache, which is critical for efficient autoregressive generation in LLMs.
The introduction of optical-circuit-switches (OCSs) has been considered as key to cost-effectively scale the AI interconnect infrastructure. However, current AI interconnect is realized by vendor-proprietary hardware and software solutions and we thus lack the interoperability and openness in this domain. This could lead to increase both capital and operational expenditure for GPU service providers. Recently, IOWN Global Forum started an activity on defining a reference implementation model for the AI interconnect infrastructure. Among several study items on that activity, this presentation introduces an open network controller framework for managing the AI interconnect with multi-vendor OCSs.
Recent advances in large-scale AI models have placed increasing pressure on the underlying compute architecture to deliver not only raw performance but also programmability and efficiency at scale. This talk introduces the Tensor Contraction Processor (TCP), a novel architecture that reconceptualizes tensor contraction as the central computational primitive, enabling a broader class of operations beyond traditional matrix multiplication. We will present the motivation behind this architectural shift, its implications for compiler design and runtime scheduling, and findings related to performance and energy efficiency. The discussion will also explore how exposing tensor contraction at the hardware level opens opportunities for more expressive and seamless execution strategies, potentially reducing data movement and improving utilization. We will share key learnings from scaling the chip across servers and racks, highlight intersections with relevant OCP Project areas, and discuss how these insights are informing our product roadmap.
Mohamad El-Batal is the Seagate Enterprise Data Solutions (EDS) Chief Technologist, and part of the overall Seagate Office of the CTO team. He is given the opportunity to shape the Seagate EDS strategy and future storage product technology roadmap.In his career Mohamad lead a team... Read More →