AI workloads demand unprecedented levels of bandwidth, low latency, and deterministic communication across increasingly dense compute infrastructures. This work focuses on emerging network architectures tailored for AI servers, racks, and clusters—highlighting trends such as high-radix topologies, RDMA over converged Ethernet (RoCE), optical interconnects, and in-network compute. It examines how networking shapes system performance, scalability, and efficiency, and outlines architectural strategies to address bottlenecks in collective communication, model parallelism, and distributed training at hyperscale.
RoCEv2 is getting widely deployed due to emerging GenAI trend, and there is growing needs to mix AI workload and HPC workload to maximize infrastructure investment efficiency, RoCEv2 which is developed decades ago for simple workload starts to show its issues in hyperscale deployment, this leads to the development of UEC – Ultra Ethernet Consortium.
MUFG Bank and NTTDATA worked with the IOWN Global Forum to create use cases and Reference Implementation Models demonstrating how APN and Optical DC networks can transform financial systems. A white paper was published detailing these innovations and their applications. MUFG Bank, NTTDATA, and NTT West tested their ideas through pilot experiments, yielding two major outcomes: real-time database synchronization reduces reliance on complex backup frameworks, and virtual instances enable seamless Data Center transitions without downtime, enhancing efficiency. This allows financial institutions to meet evolving customer demands more effectively. Research is ongoing to integrate Optical Network technology with OCP hardware to address latency, bandwidth, and prioritization issues for NIC-to-NIC communication. By combining software, hardware, and networks, NTTDATA aims to create smarter, more resilient financial systems.
FBOSS is Meta’s own Software Stack for managing Network Switches deployed in Meta’s data centers. It is one of the largest services in Meta in terms of the number of instances deployed.
Network Traffic in AI Fabric presents unique challenges such as “elephant flows” (a small number of extremely large, continuous flows), and low entropy (limited variation in flow characteristics, increasing likelihood of hash collisions).
At OCP 2024, we showcased how we evolved FBOSS to tackle these challenges. This solution is capable of building non-blocking clusters for up to 4K GPUs. However, generative AI use cases demand significantly larger non-blocking clusters. This can be solved by interconnecting multiple 4K GPU clusters into a single, larger cluster using traditional Routing and ECMP. In this design, intra-cluster traffic benefits from non-blocking I/O, but inter-cluster traffic continues to suffer from poor network performance due to the aforementioned elephant flows and low entropy.
In this talk, we will share our journey evolving FBOSS for generative AI workloads. We will discuss the hierarchical design that enables us to build significantly larger non-blocking clusters, the unique challenges we encountered in scaling both the dataplane and control plane, and the solutions we developed to overcome them. Additionally, we will highlight the SAI enhancements that were instrumental in adapting FBOSS to support the demands of generative AI.
Highlight the CXL RAS specific Firmware First error handling use cases that are implemented for the CXL specification which is generic in nature. When developing the support for the CXL RAS Firmware First, encountered different use cases that needs to be solved to implement the firmware and customer problems. The use cases resolution would involve the design and implementation of the firmware to handle them. Usage of primary and secondary Mailbox to overcome the IHV early adoption in engineering/debug effort. Common error signaling protocols usage for protocol errors. GUID and UUID usage in firmware on both CPU/host side, CXL devices and Operating system. Interaction communication failures between the CPU and CXL devices during boot time and run time that needs to be notified to users/operating system. Techniques used to improve the SMI latency. Error pollution use cases handling on both protocol, Memory error firmware notification (MEFN) and Flat2LM cases. The CXL ecosystem comprises of multitude of component vendors like SoC, Memory, Storage, Networking, etc. The explosive growth of internet content and the resulting data storage and computation requirements has resulted in the deployment of heterogenous and complex solutions in the very large-scale data centers. These warehouse sized buildings are packed with server, storage and network hardware. Specifically if there is a uncorrected fatal error detected by hardware that pose a containment risk. The system needs to be reset and restarted if possible to enable continued operation. The error affects the entire CXL device, a persistent/permanent memory device is considered to have experienced a dirty shut-down.
While AI/ML clusters continue to scale and are breaching the boundaries of physical locations in terms of both size and power - the need to scale and interconnect different locations becomes ever more crucial.
When new challenges of interconnected locations are extended to these use-cases, few considerations have to be met: - Allowing high bandwidth to be effectively used between geographically dispersed location through various distances - Support for lossless RDMA traffic - Simple and condensed interconnection layer
The presentation will focus on how Broadcom’s Jericho product line allows for the implementation of such needs with innovations throughout the stack - from physical connectivity and all the way to intelligent load-balancing.
This presentation will focus on an innovative dynamic Explicit Congestion Notification (ECN) threshold testing methodology, emphasizing the design rationale for test cases and the observational analysis of experimental results. We will explore how designed test cases trigger ECN threshold changes in dynamic network environments, ensuring comprehensive and effective testing.
A key insight from our research is the critical role of qp-fairness (Queue Pair fairness) in collective benchmarking, alongside traditional metrics like algorithmic bandwidth and bus bandwidth. Through comparative analysis of real-world test data, we demonstrate how maintaining qp-fairness under dynamic conditions significantly enhances the stability of ECN mechanisms and ensures equitable allocation of network resources. By aligning theoretical insights with practical implementations, we hope to provide actionable insights for advancing research and applications in dynamic ECN technologies.
This session will share our practical experience in AI data center architecture planning, covering the critical interplay of computing, storage, and management. We'll delve into our comprehensive approach to designing scalable and efficient AI infrastructure.
A significant portion of our discussion will be dedicated to our innovative storage architecture, which addresses diverse AI workload demands through a dual-part strategy. First, we will present our high-performance storage solution, leveraging BeeGFS and GRAID's cutting-edge technologies with NVMe SSDs to meet the demands of intense AI computation. Second, we will explore our approach to tenant object storage, specifically utilizing GRAID's SupremeRAID alongside NVMe SSDs to provide robust and scalable data management for various user requirements.
NTT is considering Data-Centric Infrastructure (DCI) using IOWN technologies. DCI processes data efficiently by combining geographically dispersed resources. To achieve this, we’re verifying Composable Disaggregated Infrastructure (CDI) – a flexible hardware solution – and considering a multi-vendor approach. CDI consists of servers, PCIe expansion boxes, and switches, enabling software-controlled allocation of accelerators for optimal performance. Utilizing multi-vendor CDI requires an interface like OFA Sunfish to reduce operational costs. Our verification has revealed challenges in the physical operation of CDI and implementing a multi-vendor configuration. These include increased cabling costs, racking limitations, and inconsistencies in product functionalities and procedures requiring careful configuration management. This session will share these challenges and proposed solutions.
The proposed Layer 2 transparent network, bridging VM and Container networks, is a software-defined network for AI services deployment. The cloud provider offers a tenant-aware and transparent combining VM and Container network into the same network domain. The benefits of this network are to provide the full Layer 2 network and to reduce communication overhead in the multi-tenant cloud system. The tenant can deploy its services in VMs and Containers. The communications among VMs and containers are in the same Layer 2 domain. It could reduce routing efforts and isolate the network traffic among different tenants.
Decades-old copper and optical interconnect technologies limit AI cluster compute efficiency. The presentation will showcase e-Tube Technology - RF data transmission over plastic waveguide - and how it breaks the barriers of these legacy technologies by providing near-zero latency and 3x better energy efficiency than optics at a cost structure similar to copper. e-Tube is an ideal replacement for copper for terabit interconnect to scale up next-generation AI clusters.
As data traffic continues to surge across AI networks, the need for higher bandwidth and efficient signal connectivity is critical. With 200G/lambda generation well on the way to production, focus is quickly moving to the next 400G/lane SerDes, which represents a significant leap in interconnect performance. This advancement enables interconnects capable of reaching 3.2Tbps and beyond by aggregating fewer, faster lanes with the need to balance cost, power consumption and footprint per bit. In this presentation, we delve into the high-speed protocols such as Ethernet, UALink, and Ultra Ethernet – exploring the first use case where 400G/lane SerDes will potentially be deployed. We’ll take a deeper look into different modulation formats with their benefits and challenges. Special attention will be given to the adoption of optical connectivity. We aim to provide a comprehensive overview of the options available and justify their use in modern cloud service architectures.
Traditional network infrastructure observability tools fall short in AI environments, where interdependence between networking and computing layers directly impacts inference latency and throughput. Modern AI workloads—particularly large language models and computer vision pipelines—demand synchronized visibility across the data transport path (RDMA/GPU-to-GPU) and GPU execution stack to ensure performance consistency, avoid bottlenecks, and support real-time SLAs.
Our panelists will share their views and real world learnings on the required observabiolitty paradigm shiftings in opened networking in terms of architecture design, telemetry stack, policy engine, etc. that drives closed loop observability
The rapid growth of AI chips has increased computational demands, and future high-performance computing (HPC) systems are expected to integrate multiple high-power chips, resulting in total power consumption of over 2.5kW and individual chip power densities exceeding 200W/cm². To tackle these challenges, advanced cooling technologies are essential to lower thermal resistance and efficiently dissipate heat. In this paper, we explore innovative structural designs for cold plates that address critical thermal management challenges for next-generation AI systems, as well as the corresponding thermal test vehicle that can generate different power densities.
Inference processing of large language models (LLMs) is computationally intensive, and efficient management and reuse of intermediate data, known as KV Cache, are crucial for performance improvement. In this presentation, we propose a novel architecture leveraging NTT's innovative photonics-based networking technology, "IOWN APN (All-Photonics Network)," to enable low-latency, high-bandwidth sharing of large-scale KV Cache among geographically distributed data centers. By exploiting the unique capabilities of IOWN APN, the proposed KV Cache sharing system significantly enhances inference throughput and improves power efficiency, paving the way for reduced environmental impact and more sustainable operational models for LLM inference. Through this presentation, we aim to engage with the OCP community to discuss the potential for wide-area distributed AI computing based on open standards.
In the realm of AI networks, the health of physical links is paramount to ensuring optimal performance and reliability. At Meta, we recognize that robust physical connectivity is crucial for the seamless operation of AI workloads, which demand high-speed and reliable data transmission. This presentation will delve into Meta's comprehensive strategy for maintaining healthy physical links within our AI networks.
We will explore the significance of link health in AI networks, emphasizing how it impacts overall system efficiency and performance. Meta employs advanced physical layer diagnostics, including Pseudo-Random Binary Sequence (PRBS) and Forward Error Correction (FEC) diagnostics, to rigorously test and validate link integrity before deployment into production. These diagnostics help identify potential issues, ensuring only healthy links are operational.
Furthermore, we will discuss Meta's proactive approach to managing link health in production environments. Unhealthy links are swiftly removed from service, and an automated triage pipeline is employed to facilitate effective repairs. This pipeline not only enhances the speed and accuracy of link restoration but also minimizes downtime, thereby maintaining the high reliability standards expected in AI network operations.
OCP Composable Security Architecture has been codified YoY and strides have been made across datacenter specifications to articulate architectural building blocks, provide open source reference implementations that are productizable by the OCP members & compliance guidelines. In this presentation the speaker will summarize the various initiatives and share how AMD has been adopting these building blocks in its datacenter products.
This presentation delves into challenges and opportunities for AIaaS-providers to efficiently deploy and manage multi-tenant AI fabrics and clusters. Deployment of SONiC AI infrastructure with optimal tuning especially for AIaaS provider, or and Enterprise supporting Inference at Edge can be a complex and daunting task. We will present the required features to simplify deployment of backend AI SONiC Fabrics in a controller. Tuning fabrics supporting AI must be take into consideration factors such as, AI job type as well as its sensitivity to latency, tier of the tenant scheduling the job, and tuning capabilities of the underlying SONiC platforms, and implement an adaptive solution. The presentation introduces the concept of AI tenancy,, and how tenancy could be considered when orchestrating and tuning the underlying infrastructure.
As AI models and data analytics workloads continue to scale, memory bandwidth and capacity have become critical bottlenecks in modern data centers. CXL provides a high-capacity, low-latency memory expansion that can be leveraged in different usage models. CXL memory expansion and Pooling can significantly enhance SQL workload performance and reduced cloud TCO, particularly for in-memory databases and analytics workloads that are bandwidth and capacity constrained. Also Offloading the key-value (KV) cache to Compute Express Link (CXL) memory is emerging as an effective strategy to tackle memory bottlenecks and improve throughput in large language model (LLM) inference serving by storing KV cache, which is critical for efficient autoregressive generation in LLMs.
The introduction of optical-circuit-switches (OCSs) has been considered as key to cost-effectively scale the AI interconnect infrastructure. However, current AI interconnect is realized by vendor-proprietary hardware and software solutions and we thus lack the interoperability and openness in this domain. This could lead to increase both capital and operational expenditure for GPU service providers. Recently, IOWN Global Forum started an activity on defining a reference implementation model for the AI interconnect infrastructure. Among several study items on that activity, this presentation introduces an open network controller framework for managing the AI interconnect with multi-vendor OCSs.
Recent advances in large-scale AI models have placed increasing pressure on the underlying compute architecture to deliver not only raw performance but also programmability and efficiency at scale. This talk introduces the Tensor Contraction Processor (TCP), a novel architecture that reconceptualizes tensor contraction as the central computational primitive, enabling a broader class of operations beyond traditional matrix multiplication. We will present the motivation behind this architectural shift, its implications for compiler design and runtime scheduling, and findings related to performance and energy efficiency. The discussion will also explore how exposing tensor contraction at the hardware level opens opportunities for more expressive and seamless execution strategies, potentially reducing data movement and improving utilization. We will share key learnings from scaling the chip across servers and racks, highlight intersections with relevant OCP Project areas, and discuss how these insights are informing our product roadmap.
AI workloads are reshaping the architecture and demands of modern data centers, calling for high-performance, scalable, and energy-efficient infrastructure. This presentation explores how AI-driven transformation is impacting data center design and operations, and highlights how Delta leverages the expertise in power and thermal solutions to meet these demands. Delta’s integrated systems play a crucial role in ensuring reliable, intelligent, and sustainable operations in the age of AI.
Wednesday August 6, 2025 9:10am - 9:30am PDT TaiNEX2 - 701 G
Nowadays, data centers use dielectric fluids as a coolant to prevent damage and downtime when leakage occurs. However, dielectric fluids typically have high viscosity and low specific heat, resulting in poor cooling performance. To improve performance while retaining the benefits of dielectric fluids, Superfluid technology has emerged and been investigated. Superfluid technology introduces air into the coolant, forming bubbles that reduce the frictional resistance in the movement of the coolant. This results in a lower boundary layer thickness and enhances the heat convection coefficient. When using a specific dielectric fluid with superfluid technology, the heat transfer capacity can achieve 66% of that of water (compared to 55% with dielectric fluid alone). This paper implemented superfluid technology on an AI server with a cold plate solution as a test platform and explored the improvements brought by superfluid technology.
The growing scale and specialization of AI workloads are reshaping infrastructure design. With Arm Chiplet System Architecture, it enables custom silicon/chiplet to meet market-specific needs. In this talk, we explore how chiplet-based designs optimize performance and lower total cost of ownership. Learn how standards, compute subsystems, and a maturing ecosystem are reshaping the datacenter at scale.
I will go through the topic from a single server node design to the data-centre- level design under economy od scale in terms of mechanical thermal power and such things that we can differentiate ourselves.
The talk explored Azure’s purpose-built infrastructure—featuring advanced accelerators, scalable networking, and robust orchestration—and its journey of innovation through partnerships, including the Mount Diablo project with Meta/Google. Emphasis was placed on overcoming challenges in power delivery, cooling, and energy efficiency, with a call to reimagine system architecture and embrace high-voltage DC solutions to sustainably scale next-generation AI workloads.
Liteon will share its latest advancements in power solutions for AI infrastructure, focusing on high-efficiency, high-density designs for GPU-centric systems. This session will explore how Liteon's integrated architectures support scalable deployment in modern data centers, addressing the growing demands of performance and energy optimization.
Wednesday August 6, 2025 9:30am - 9:50am PDT TaiNEX2 - 701 G
As data centers evolve to meet increasing demands for energy efficiency, operational safety, and environmental sustainability, cooling technologies play a pivotal role in enabling this transformation. This presentation explores how synthetic ester coolants offer a versatile and eco-friendly solution to address the diverse thermal management needs of modern data centers.
- Meta's latest AI/ML rack design, Catalina (GB200), Meta's latest AI system features a compute tray that serves as the primary CPU+GPU components. To expedite time-to-market, we leveraged industry solutions while implementing targeted customizations to optimize integration within Meta's infrastructure. - The increasing power density of AI hardware poses significant challenges, including the need for liquid cooling, which introduces complexities in leak detection, system response, reliability, and safety. With multiple hardware platforms in rapid development, there is a pressing need for adaptable hardware that can manage these new interfaces and controls. - Our solution, the RMC (Rack Management Controller) tray, addresses these challenges by providing a 1OU device that handles all leak detection and hardware response to leaks. The RMC offers flexible integration into upcoming AI platforms and interfaces with various systems, including Air-Assisted Liquid Cooling (AALC), Facility Liquid Cooling (FLC), and all leak sensors. The RMC provides a robust and reliable solution for managing liquid cooling across Meta’s multiple platforms.
Data center operators and silicon providers are aligning on a durable coolant temperature. of 30℃ to meet long-term roadmaps. There is also interest in supporting higher coolant temperatures for heat reuse and lower temperatures for extreme density required for AI workloads. To understand coolant temperature requirements, thermal resistance from silicon to the environment will be discussed. In addition, areas of thermal performance be investigated by the industry will be reviewed.
As technological progress and innovation continue to shape server products, Wiwynn introduces a reinforced chassis with a novel embossed pattern design to reduce material consumption and carbon footprint. This paper presents the development process, from pattern optimization using Finite Element Analysis (FEA) to real-world static and dynamic mechanical testing for verification. Through this approach, Wiwynn successfully developed an embossed pattern, enabling the replacement of the original heavy chassis with a thinner and lighter design. In currently applications, this innovation has reduced material usage by at least 16.7% and lowered carbon emissions by approximately 15.9%, while achieving a 4.2% cost reduction. This lightweight, cost-effective, and sustainable chassis design reinforces Wiwynn’s commitment to sustainable server solutions and offers potential for further development.
As AI servers rapidly scale in performance and density, traditional data centers face increasing challenges in meeting low PUE (Power Usage Effectiveness) targets and high TDP (Thermal Design Power) components cooling due to infrastructure limitations. Cold plate liquid cooling has emerged as a mainstream solution with its high thermal efficiency. However, the risk of coolant leakage — potentially damaging AI systems — remains a significant concern. While existing mechanisms (e.g. leak detection) offer a partial safeguard, still do not address the root cause. To resolve this, Intel introduces a game-changing approach by replacing conventional coolants with dielectric fluids, inherently eliminating the threat of electrical damage from leaks. Recognizing the thermal performance limitations of dielectric fluids compared to water, Intel integrates superfluid technology into CDU to dramatically enhance heat dissipation capabilities. This innovation not only fortifies cold plate cooling systems but also paves the way for extending the benefits to single-phase immersion cooling, redefining the technical boundaries of liquid cooling in data centers.
Large Language Models (LLMs) have demonstrated exceptional performance across numerous generative AI applications, but require large model parameter sizes. These parameters range from several billion to trillions, leading to significant computational demands for both AI training and inference. The growth rate of these computational requirements significantly outpaces advancements in semiconductor process technology. Consequently, innovative IC and system design techniques are essential to address challenges related to computing power, memory, bandwidth, energy consumption, and thermal management to meet AI computing needs.
In this talk, we will explore the evolution of LLMs in the generative AI era and their influence on AI computing design trends. For AI computing in data centers, both scale-up and scale-out strategies are employed to deliver the huge computational power required by LLMs. Conversely, even smaller LLM models for edge devices demand more resources than previous generations without LLMs. Moreover, edge devices may also act as orchestrators in device-cloud collaboration. These emerging trends will significantly shape the design of future computing architectures and influence the advancement of circuit and system designs.
Nepal is in early stages of the digitisation, after the political reform infrastructure, digitisation and egovernance project are widely scaling, however the amount of infrastructure that were supposed to scale has not been done as per the demand. Sustainable solutions and scalability has not been in the priority as still awareness is required however the basic data center design around the fintech and health care technology is widely scaling. Being closely working with Ministries and government projects there are requirements within the government but right direction and roadmap is required for which National level Blue print document to include the AI, Healthcare, Fintech and National interoperability project is in the pipeline. Interoperability layer requires a lot of the resources to be build, health related Interoperability layer has been built but we are looking after the National IOL layer. OpenMRS, OpenStack, OpenHIM, OpenHIE, Ubuntu, Nutanix, Dell are key players.
The relentless demand for AI is driving hyperscalers to deploy ever-increasing clusters of GPUs and custom accelerators. As these deployments scale, system architectures must evolve to balance cost, performance, power, and reliability. A critical aspect of this evolution is the high-speed signaling that connects the various components. This presentation delves into the high-speed protocols such as PCIe, CXL, UALink, Ethernet, and Ultra Ethernet – exploring their intended use cases and evaluating where these protocols are complementary or competitive. Additionally, the presentation will address the evolving Scale-Up and Scale-Out architecture, highlighting their respective protocols and interconnect solutions. Special attention will be given to the adoption of Ethernet as a problem-solving technology in AI-driven environments. Through this discussion, we aim to provide a comprehensive overview of the options available and justify their use in modern cloud service architectures.
With the rise of AI computing, traditional air cooling methods are no longer sufficient to handle the thermal challenges in high-performance computing (HPC) systems. Liquid cooling has emerged as a reliable and efficient alternative to dissipate heat at kilowatt levels. In this presentation, we will introduce the liquid cooling technologies developed by TAIWAN MICROLOOPS, including the Cooling Distribution Unit (CDU) and various types of cold plates. Standard and customized CDUs are designed to meet refrigeration capacity demands ranging from several kilowatts to hundreds of kilowatts. We will also demonstrate both single-phase and two-phase cold plates. These solutions are designed to enhance thermal management efficiency and meet the increasing demands of AI-driven data centers.
This presentation outlines the evolving requirements and technical considerations for next-generation Open Rack V3 (ORv3) Power Supply Units (PSUs) and power shelves, with a focus on the transition from ORv3 to High Power Rack (HPR) and HPR2 architectures. It highlights significant advancements such as increased power density from 33kW to 72kW, enhanced support for AI-driven pulse load demand. HVDC architecture is also introduced for quick adaptation to solve the challenging of bus bar while power demand from AI still keeps on increasing.
Wednesday August 6, 2025 10:45am - 11:00am PDT TaiNEX2 - 701 G
The future of artificial intelligence (AI) is continuous demanding on higher performance, greater efficiency, and increasing scalability in modern data centers. As a designer of advanced server CPUs and specialized AI accelerators, AMD plays a crucial role in addressing these priorities. AMD delivers leading high-performance computing solutions, from advanced chiplet architecture and server design to rack and data center infrastructure to meet AI market demands.
FuriosaAI's technology demonstrates to infra/datacenter AI deployment professionals that rapid and more powerful advancements to GPUs are great for hyperscalers but poorly matched for typical data center (Leveraging OCP for Sovereign AI Plans - presented by Supermicro shows over 70% of data centers are 50kW - 0.5Mw.) The ability to openly choose compute projects that are designed to make computers more sustainable are the cornerstone of the OCP.
We will introduce the Tensor Contraction Processor (TCP), a novel architecture that reconceptualizes tensor contraction as the central computational primitive, enabling a broader class of operations beyond traditional matrix multiplication. And how it unlocks designing AI inference chips that can achieve the performance, programmability, and power efficiency trifecta for data centers.
Given the power constraints of data centers and the wide variation in rack power capacities, we are learning to evaluate total token generation throughput across AI accelerators within the same rack power budget, which is a metric that resonates strongly with our early enterprise and AI compute provider partners.
Energy efficiency is one of the main contributors to reaching the Paris Agreement. By optimizing the world’s energy consumption, and being able to produce more from less, we can meet our increased energy demand and reduce CO 2 emissions at the same time. In fact, according to the International Energy Agency, increased efficiency could account for more than 40% of emissions reductions in the next 20 years. As much as 50% of data center potential for energy saving comes from the waste heat recovery, and 30% can be achieved in data center buildings. And the solutions to enable these energy efficiency improvements already exist! We have decades of experience developing plate heat exchanger technologies that support our customers to optimize energy use in their processes. Our unique thermal solutions make it possible to save dramatic amounts of energy and electric power and thereby reduce carbon emissions!
The shift to +/-400V DC power systems is crucial to meet the rising power demands of AI/ML applications, supporting rack densities of >140 kW. This transition introduces significant challenges for power distribution within datacenters. Critical components like bus bars, connectors, and cables must meet stringent requirements for power handling, thermal management, reliability, safety, and density. This paper explores design solutions for electromechanical interconnects in these high-power environments, drawing parallels with mature ecosystems in industries like Electric Vehicles. Innovative approaches to bus bar design and connector technology offer the performance and space savings needed for next-gen AI/ML infrastructure. The discussion addresses crucial safety aspects, including arc flash mitigation, insulation systems, and touch-safe designs. By overcoming these challenges, the industry can accelerate the transition to higher voltages, unlocking AI/ML platforms' full potential.
Wednesday August 6, 2025 11:00am - 11:15am PDT TaiNEX2 - 701 G
Since the number of CPU cores grows significantly nowadays, the demand of hardware partitioning has become evident. Hardware partitioning could improve the security, multi-task ability and resource efficiency of each CPU. In this paper, we’d like to share Wiwynn’s concept of Hardware Partitioning (HPAR) architecture, which can be implemented in multiple CPUs system with single DC-SCM. With assistant BMC’s help, BMC has the access to each CPU and dual socket system can boot up as either single or dual node. The HPAR method creates strict boundaries between each socket, which reduces the risk of unauthorized access or data leakage between partitions. Also, each partition can perform different tasks on one system simultaneously, optimizing the hardware utilization by segmenting workloads.
When planning and operating an Internet Data Center (IDC), PUE (Power Usage Effectiveness) is a critical metric for licensing and energy performance. While technologies like direct liquid cooling and immersion cooling are effective, they often require high capital investments.
We propose an efficient and scalable solution: Turbo Blowers + Free Cooling + Heat Reuse System - Introduce outdoor air via high-efficiency turbo blowers to remove heat from hot aisles. - Capture and reuse the exhausted heat for drying, building heating, or hot water systems. - Proven performance: Microsoft applied free cooling with a PUE around 1.22 in 2021.
Trans-Inductor Voltage Regulator (TLVR) Technology is a new onboard xPU power delivery solution proposed by Google in IEEE APEC 2020. ■ TLVR is an innovative fast-transient onboard voltage regulator (VR) solution for xPUs. This VR topology provides increased VR bandwidth, faster transient response, and potential reduction in decoupling capacitors. ■ TLVR has been widely used in recent years since it offers a good transient performance with reduced equivalent output transient inductance. However, existing TLVR has not been optimized for power efficiency and density. ■ One of the limitations is that each trans-inductor has to be designed for the peak load current in terms of magnetic core saturation. ■ Zero Bias TLVR was introduced to address this limitation. It moves one phase from a primary side to a secondary side. ■ By doing so, the secondary side phase is able to drive TLVR secondary winding with equal magnitude and opposite direction to primary winding current for both DC and transient.
Wednesday August 6, 2025 11:15am - 11:30am PDT TaiNEX2 - 701 G
This session delves into the critical design considerations and emerging challenges associated with immersion cooling for high-speed signals in data centers. Key topics include the electrical characterization of cooling liquids, the performance benefits of improved thermal environments, and the impact of immersion fluids on high-speed interconnects—from individual components to entire signal channels. The discussion also covers design optimization strategies tailored for submerged environments. Finally, the session highlights the current state of industry readiness and the technical hurdles that must be addressed to ensure reliable high-speed signaling under immersion cooling conditions.
In this presentation, we will showcase a new data center architecture based on OCP Rack with liquid-cooling equipment. Under such a new AI Custer, how to collect, store, analyze, and visualize data to provide data center managers with the ability to effectively manage such a new architecture. We also provide a mechanism to effectively cooperate with the existing operating support system to seamless integrate new AI cluster architecture into legacy datacenter management. Going further, we will propose a new approach on how to use AI methodology to manage AI Clusters as Wiwynn's future work.
As the MHS standard continue to grow, the need to complete the remaining elements in the solutions become critical. Intel and UNEEC has been following the Edge -MHS standardization and working on developing off-the-shelf chassis solutions that can easily enable the Edge-MHS building blocks.
Inference tasks vary widely in complexity, data size, latency requirements, and parallelism, and each workload type interacts differently with CPU capabilities. Understanding this relationship allows for more effective hardware selection and optimization strategies tailored to specific use cases.
Key Learning Areas -AI Model Architecture -Types of Inference Workloads -Quantization: Balancing Accuracy and Efficiency -Data Throughput and Bandwidth -Benchmarking Inference Performance -Frameworks and Libraries Impact Performance
With the rapid development of AI, the demand for performance in data centers and computing infrastructure continues to rise, bringing significant challenges in energy consumption and heat dissipation. This paper discusses the application of AI in infrastructure and thermal management solutions, focusing on how Auras products integrate advanced intelligent cooling systems and temperature control technologies. By leveraging AI-driven monitoring and control, energy efficiency is significantly improved. Looking ahead, as AI technology advances, intelligent infrastructure and innovative thermal management will become key drivers for high-performance computing and green energy saving.
As memory capacity and bandwidth demands continue to rise, system designs are pushing toward higher memory density—particularly in dual-socket server platforms. This session will explore the thermal design challenges and considerations involved in supporting a 2-socket, 32-DIMM configuration on the latest Intel® Xeon® platform within a standard 19-inch rack chassis. In such configurations, DIMM pitch is constrained to 0.25"–0.27", significantly increasing the complexity of memory cooling. We will present thermal evaluation results based on Intel-developed CPU and DDR5 Thermal Test Vehicles (TTVs), which simulate real-world heat profiles and airflow interactions.
The recent publication of Ultra Ethernet 1.0 is an ambitious project of tuning the Ethernet stack to accommodate AI and HPC workloads. It covers everything from physical layer to software APIs. What makes it different? How does it work? This session explains the whys, hows, and whats of UEC 1.0 and describes the future of high-performance networking.
The rapid evolution of semiconductor technology and the growing demand for heterogeneous integration have positioned advanced packaging as a critical enabler of next-generation electronic systems. As devices become more compact and functionally dense, traditional single-die analysis methods are no longer sufficient. Instead, a system-level approach—spanning from silicon to full system integration—is essential to ensure performance, and reliability. This talk explores how advanced packaging technologies such as 2.5D/3D IC, and chiplets serve as the foundation for silicon-to-system multiphysics analysis. We delve into the multi-scale, multi-domain simulation challenges—including thermal, mechanical, electrical and optical interactions—and examine how state-of-the-art simulation tools and methodologies are bridging the gap between design abstraction levels. Finally, an AI-driven thermal analysis that illustrates how complex chiplet designs influence floorplanning decisions. That proposed approach accelerates design space exploration, enhances prediction accuracy, and enables optimization of packaging architectures—from chiplet interconnects to full-system integration.
The Universal Quick-Disconnect (UQD) has played a significant role in the cooling ecosystem for GPUs and genAI. In order to scale, and to further enable the adoption of liquid throughout the ecosystem, a workstream was established at the end of 2024 to develop a UQD Version 2. The purpose of this workstream is to update the UQD/UQDB v1 specification such that gaps in requirements and performance are resolved, ambiguity is removed, and true interoperability is defined and achievable. Key deliverables include unification of the UQD and UQDB as a singular specification, defined performance and interoperability testing requirements, and realization of a new mating configuration. Progress updates with relevant performance attributes and technical detail of the v2 proposal will be discussed, as well as plans for official release and deployment.
Beth Langer is the Lead Technical Engineer in the Thermal Management Business Unit at CPC where all connectors manufactured for liquid cooling applications meet or exceed established criterion.
For decades the motherboard ecosystem has toiled in the service of the steady tic/toc beat of server processor roadmaps. That was then - this is now! Today there are multiple processor lines each within a larger set of processor makers than ever before in the server industry. The complexity of server processor complexes have skyrocketed increasing board layers, design rules and all manner of motherboard attributes.
The DC-MHS standards come at the right time. Motherboards (transformed now to HPMs) can be much more efficiently produced when originated by the processor manufacturers. The advent of the HPM reduces costs, increases diversity of systems and generally allows the ecosystem to innovate around the processor complex including baseboard management. This comes at exactly the time when the design aperture seemed to be closing on server system vendors. DC-MHS standards have created a whole new opportunity to build thriving horizontal ecosystems.
As AI workloads push rack power demands well beyond the ~30 kW limits of Open Rack v3, the industry has defined a High-Power Rack (HPR) standard that delivers over 200 kW per rack. This talk explains how liquid-cooled vertical busbars integrate coolant channels around copper conductors to dramatically improve heat removal and reduce I²R losses, all while fitting into existing ORv3 form factors. It also covers modular power-whip assemblies for simplified maintenance, upgraded high-voltage PSUs and battery backup units for resilience, and how OCP member companies collaborate on safety, interoperability, and scalability. Together, these innovations form an end-to-end ecosystem enabling next-generation AI data centers to meet extreme power, thermal, and reliability requirements.
Wednesday August 6, 2025 1:10pm - 1:30pm PDT TaiNEX2 - 701 G
Listeners will gain a clear understanding of the differences between single-phase (1P) and two-phase (2P) direct liquid cooling (DLC) technologies, including the thermal mechanisms, benefits, and limitations of each. The paper offers practical insights into real-world challenges of implementing 2P DLC, such as pressure drop effects, series vs. parallel configurations, and flow imbalance. A new method for calculating thermal resistance in 2P systems is introduced, enabling fair comparison to 1P systems. Listeners will also learn about economic and operational barriers to 2P adoption, including refrigerant costs and high system pressure. By the end, they will understand why 1P DLC is currently more viable for mass deployment and what advancements are needed for 2P DLC to become practical for data centers.
This talk presents the integration of OpenBMC with Arm Fixed Virtual Platforms (FVP) to prototype manageability features aligned with SBMR compliance. It showcases lessons from virtual platform development, sensor telemetry, and Redfish-based remote management, enabling early-stage validation without physical hardware.
As compute densities soar and chip thermal loads rise, data centers are under pressure to deliver efficient, scalable cooling without extensive retrofits. 2-phase liquid cooling, integrated into modular sidecar systems, offers a high-performance, energy-efficient solution that meets this need while maintaining compatibility with existing infrastructure. The presentation will dive into how sidecar architectures—deployed alongside standard racks—leverage the latent heat of vaporization to manage extreme heat loads with minimal coolant flow. By maintaining constant temperatures across cold plates, 2-phase cooling ensures thermal uniformity for processors with varying power profiles, preventing hot spots and throttling. Key takeaways will include how 2-phase sidecars enable efficient, localized cooling without facility water, deployment strategies for retrofitting existing data centers without major disruption and environmental benefits such as reduced energy use and lower carbon footprint
As AI and ML power demands increase, driving rack power levels to 140 kW and necessitating higher voltages like +/-400V DC, optimizing bus bar systems becomes crucial for efficient, reliable power delivery. Bus bars, ideal for high-current applications, face unique challenges in high-density AI/ML racks, including thermal management, space optimization, structural rigidity, and safety. This paper explores advanced design techniques for future AI/ML power architectures, covering material selection (e.g., copper, aluminum), cross-section optimization, insulation strategies, and terminal methods. Thermal and mechanical simulations ensure performance and durability. Critical safety features, such as touch protection and creepage distances, are integrated. These solutions aim to develop robust power infrastructure for next-gen AI/ML data centers.
Wednesday August 6, 2025 1:30pm - 1:50pm PDT TaiNEX2 - 701 G
This presentation will showcase the current two-phase CDU design and the two-phase cold plate samples. The test results of the two-phase cold plate samples are compared with those of the same samples filled with PG25 as the working fluid. Based on the comparison, the potential of the two-phase cold plates can be discovered. Without significantly altering the existing single-phase architecture, the two-phase coolant can be distributed freely to various racks, providing a solution for the chips with locally higher heat flux. Lastly, the future role of the pumped two-phase solution in the cooling environment, and the forthcoming business model will be discussed.
As data center power densities surge, traditional air cooling increasingly fails to meet thermal demands efficiently. This presentation explores the evolution of Direct Liquid Cooling (DLC), tracing its progression from single-phase to two-phase technologies. We begin by examining single-phase DLC, where coolant absorbs heat without phase change, offering reliable yet limited performance. We then transition to two-phase DLC, where phase change enables significantly higher heat flux dissipation through latent heat transfer. Key distinctions in efficiency, system complexity, and deployment readiness are analyzed. The session concludes with emerging trends such as low-GWP dielectric fluids and 3D chip cooling that position two-phase DLC as a critical enabler for next-generation high-performance computing and AI workloads.
Google contributed the Advanced PCIe Enclosure Compatible (APEC) Form Factor to OCP in 2024. APEC (Advanced PCIe Enclosure Compatible) is an electrical mechanical interface standard intended to advance the PCIe CEM standard with increased PCIe lane count, bandwidth, power, and management capability for use cases that need more advanced capabilities. This session will go deeper on what progress we have made, including the test methodology and challenges, and also our next steps to keep this moving forward. To make this happen, Google has developed the end-to-end testing modules to qualify the signals at both PCIe root complex and endpoint based on APEC. We will guide you through how the test module was designed from SI and layout routing considerations toward the goal of test efficiency and automation.
■ This talk traces the evolution of 48V power delivery architectures for datacenter applications since commencing with Google's introduction of a tray-level, two-stage approach at OCP in 2016. ■ Subsequent advancements in topologies and ecosystems have paved the way for collaborative standardization efforts. ■ In 2024, Google, Microsoft, Meta jointly presented an updated 48V Onboard Power Specification and Qualification Framework, leading to the formation of an OCP workstream aimed at finalizing and implementing comprehensive 48V power module solutions and qualification protocols. ■ This talk will outline critical design principles to mitigate challenges associated with 48V two-stage power delivery, encompassing power failure mechanisms in complex 48V environments, explore the challenges of high power density and physical limitations, providing a detailed electrical specification and qualification requirements for data centers applications.
Wednesday August 6, 2025 1:50pm - 2:10pm PDT TaiNEX2 - 701 G
The dimensions of the Intel next platform has experienced an increase compared to the preceding ones, primarily due to the augmentation in pin count to increase the signal to dnoise ratio in both PCI Express 6.0 and DDR5. This alteration creates difficulties in arranging two processors, each of them has a 16 DDR5 channels, on a standard 19-inch rack. In response to this issue, Intel has embarked on a strategic initiative aimed at facilitating the accommodation of this challenge, which involves a proposal to reduce the distance between DDR5 connectors (a.k.a. DIMM pitch) as well as the processor’s keep out zone. To increase the DDR routing space underneath the DIMM connector’s pin-field area after shrinking the DIMM to DIMMM pitch, VIPPO (Via in Pad Plated Over) PCB (Printed Circuit Board) technology is used. These technologies significantly enhance signal quality when embracing the next generation MCRDIMM (Multiplexer Combined Ranks DIMM).
As the demand for AI compute grows, Cloud Service Providers, System OEMs, and IP/Silicon Providers need an efficient, optimized solution to address scale-up AI challenges. UALink defines a low-latency, high-bandwidth interconnect for communication between accelerators and switches in AI computing pods.
During this session, a UALink panel of experts will highlight how UALink increases performance, power and cost efficiency while introducing supply chain diversity and interoperability to enable next-generation AI/ML applications. Attendees will have the unique opportunity to explore the applications and use cases for UALink technology and ask questions about the future of the UALink ecosystem.
Wiwynn collaborates with Intel through the Open IP program to integrate a 1OU computing server into Intel’s single-phase immersion cooling tank, following the OCP ORv3 standard. The system uses Perstorp’s Synmerse DC synthetic ester coolant to thoroughly evaluate thermal performance under high-power workloads. In this study, CPUs are stressed up to 550W TDP, while researchers examine how variables such as CDU pumping frequency, inlet coolant temperature, and different heatsink types impact cooling effectiveness. Results are compared to those of traditional air cooling systems under similar operating conditions. The goal of this analysis is to optimize immersion cooling approaches, providing valuable insights for improving thermal management in high-performance computing and modern data centers.
AI / HPC solutions is being addressed by Heterogenous Packaging 2.5 and 3D. Chiplets and HBM stack are finding way to realise the product development quicker and optimised for the required performance. Till now mostly the Chiplets based integration is Homogeneous ( Same kind of Chiplet designed within the company only HBM stack from third party vendor) but the industry started to move to Heterogenous ( different Chiplet from different vendors ). Complex Package design with different CTE ( Coefficient of Thermal Expansion ) becomes a key aspect to be taken care in the material selection and design as the physical phenomena can impact the Physical and Electrical aspects of the Device and hence Final Test Yield, Reliability and Field Returns. A well thought "Design For Test" to Final ATE Test strategy ( Wafer and Package) are required to optimise Test cost, performance and product reliability, as the defects in even a single Chiplet can lead to costly failures at the System Level.
According to the development trend of power consumption and heat dissipation in the general sever and AI server, the evolution of cooling solutions has changed from air cooling to hybrid cooling, then to full liquid cooling. In response to this development trend, we proposed an integrated liquid cooling solution for the building blocks of the AI clusters, including the AI IT rack, High Power Cabinet, and the Cooling Cabinet.
As system complexity grows, ensuring reliability, power efficiency, and performance is critical. proteanTecs, a leader in electronics monitoring, has integrated its deep data monitoring with Arm’s System Monitoring Control Framework (SMCF), enhancing Arm Neoverse CSS solutions with predictive analytics and lifecycle insights. SMCF offers a modular framework for telemetry, diagnostics, and control. By embedding proteanTecs' in-chip agents and software, the integration boosts system visibility, enabling optimized power/performance, improved reliability, and faster diagnostics. This collaboration empowers semiconductor manufacturers and system operators to meet evolving demands with scalable, architecture-agnostic solutions. The presentation will highlight key applications such as predictive maintenance, defect detection, and power optimization for next-gen high-performance compute environments.
The key focus of this presentation is on the safety requirements for liquid cooling technologies systems, particularly regarding pressurized liquid-filled components (LFCs), as addressed in Annex G.15 of IEC 62368-1. By analyzing the construction and testing requirements specified in the standard, this presentation offers insights into designing safe and reliable liquid cooling solutions aimed at mitigating risks associated with leaks, preventing hardware damage, and ensuring global regulatory compliance in AI and ML-driven data centers.
As the power consumption of each high densigy AI server rack goes higher and higher, the design of the cabinet can no longer only consider a signle AI server rack, but must also take the power cabinet and even the cooling cabinet into consideration. This presentation will introduce a rack architecture to integrate the AI server rack with power loop and cooling loop.
Wednesday August 6, 2025 2:45pm - 3:00pm PDT TaiNEX2 - 701 G
Enabling Direct Liquid Cooled (DLC) IT solutions in data center environments requires a comprehensive understanding of the facility design, Coolant Distribution Units (CDU), and the IT solutions. There are many interdependencies and design considerations when integrating and commissioning DLC solutions in data center environments. The Open Compute Project (OCP) Community has many workgroups which are addressing various aspects of the DLC solution enablement.
The ORV3 OCP ecosystem currently lacks robust protection for the rack-loaded lifecycle in ship-loadable packaging. This presentation will highlight the innovative packaging solution developed to ensure safe transport of a fully-loaded ORV3 system. We will delve into the design considerations that maintain both rack protection and cost-efficiency. Additionally, we will provide an overview of the extensive testing conducted to validate the system’s resilience and ensure the protection of the rack and equipment from transportation-related impacts.
Wednesday August 6, 2025 3:00pm - 3:15pm PDT TaiNEX2 - 701 G
This study investigates galvanic corrosion in heterogeneous metal materials utilized in cold plate assemblies for single-phase liquid cooling systems. The galvanic corrosion behavior (Tafel plot) of pure copper, stainless steel 304, stainless steel 316, and nickel-based brazing fillers (BNi2 and BNi6) immersed in PG25 working fluid was measured on days 0, 7, and 60. Furthermore, accelerated aeration experiments were conducted on PG25 to assess its chemical stability, and its electrochemical properties were subsequently analyzed after 30 days of aeration using electrochemical methods.
This presentation offers a comprehensive overview of key accessories in the ORv3 ecosystem, highlighting two main areas: the 19” adapter and cabling & airflow management solutions. We will introduce essential components, including the 19” adapter rail, cable management arm, blanking panels, side skirts, and side expanders, detailing their design and benefits for the community. Additionally, the session will explore the extensive testing conducted on these accessories. These solutions are crucial for modern data centres, offering flexible, efficient, and organized approaches to infrastructure management.
Wednesday August 6, 2025 3:15pm - 3:30pm PDT TaiNEX2 - 701 G
Exploring liquid-cooled bus bars addresses the increasing power demands in modern data centers, particularly those exceeding 150kW per rack with AI and HPC workloads. Traditional bus bar designs struggle with current limitations, hindering efficient power management. Liquid-cooled bus bars integrate cooling channels to enhance heat dissipation, maintaining optimal temperatures and improving system safety and reliability. This approach mitigates thermal runaway risks and ensures compliance with industry standards, while boosting efficiency by minimizing energy losses associated with high current densities. Implementing liquid-cooled bus bars signifies a significant advancement in data center infrastructure, enabling higher power densities, superior thermal management, and overall improved performance.
cool data centers in a very energy-efficient way, and we recover and reuse the excess heat produced within the data centers. This is what we consider green digitalization!
ChatGPT began AI's watershed moment that triggered IT infrastructure's tectonic shift and race in extraordinary and lasting commitments to AI Factory. Many governments and enterprises alike are making enormous capital and people investments to not be left behind the AI boom. Corporate boardrooms are evaluating purposeful infrastructure plans. What is the best architectural decision - retrofitting, built from scratch or adopt a wait-and-see? This fork in the road has given pause and decision paralysis to some infrastructure decision makers. Our talk examines the AI Factory Spectrum to identify solutions that advance the infrastructure challenge sustainably.
This study explores the long-term stability of immersion cooling fluids through accelerated aging experiments designed to comply severer operational conditions. As immersion cooling becomes a vital solution in high-performance and data-intensive systems, understanding fluid deterioration behavior over thermal and metal induced decay is essential for ensuring system reliability. By subjecting the fluids to several thermal stress over time at the present of metal, we continuously monitor key aging indicators such as flash point descend, dielectric constant& tangent loss shift, viscosity change, acid number increase and oxide accumulate . These metrics are then used to construct predictive models that define its "" fluid's stability window"" under real-world conditions. The resulting approach enables manufacturers and system integrators to determine quality assurance periods more accurately, facilitating better maintenance planning and formulation design.