Mohamad El-Batal is the Seagate Enterprise Data Solutions (EDS) Chief Technologist, and part of the overall Seagate Office of the CTO team. He is given the opportunity to shape the Seagate EDS strategy and future storage product technology roadmap.In his career Mohamad lead a team... Read More →
Highlight the CXL RAS specific Firmware First error handling use cases that are implemented for the CXL specification which is generic in nature. When developing the support for the CXL RAS Firmware First, encountered different use cases that needs to be solved to implement the firmware and customer problems. The use cases resolution would involve the design and implementation of the firmware to handle them. Usage of primary and secondary Mailbox to overcome the IHV early adoption in engineering/debug effort. Common error signaling protocols usage for protocol errors. GUID and UUID usage in firmware on both CPU/host side, CXL devices and Operating system. Interaction communication failures between the CPU and CXL devices during boot time and run time that needs to be notified to users/operating system. Techniques used to improve the SMI latency. Error pollution use cases handling on both protocol, Memory error firmware notification (MEFN) and Flat2LM cases. The CXL ecosystem comprises of multitude of component vendors like SoC, Memory, Storage, Networking, etc. The explosive growth of internet content and the resulting data storage and computation requirements has resulted in the deployment of heterogenous and complex solutions in the very large-scale data centers. These warehouse sized buildings are packed with server, storage and network hardware. Specifically if there is a uncorrected fatal error detected by hardware that pose a containment risk. The system needs to be reset and restarted if possible to enable continued operation. The error affects the entire CXL device, a persistent/permanent memory device is considered to have experienced a dirty shut-down.
This session will share our practical experience in AI data center architecture planning, covering the critical interplay of computing, storage, and management. We'll delve into our comprehensive approach to designing scalable and efficient AI infrastructure.
A significant portion of our discussion will be dedicated to our innovative storage architecture, which addresses diverse AI workload demands through a dual-part strategy. First, we will present our high-performance storage solution, leveraging BeeGFS and GRAID's cutting-edge technologies with NVMe SSDs to meet the demands of intense AI computation. Second, we will explore our approach to tenant object storage, specifically utilizing GRAID's SupremeRAID alongside NVMe SSDs to provide robust and scalable data management for various user requirements.
The demands of AI training and inference workloads heavily rely on efficient I/O operations. This presentation will explore how SSDs are architected to deliver superior performance across diverse I/O characteristics inherent in these AI processes. We'll demonstrate the advantages of NVMe SSDs in both training and inference environments, highlighting their strengths in handling various I/O patterns. Furthermore, we will delve into the critical aspect of checkpointing performance in AI pipelines, specifically showcasing how FDP significantly enhances checkpointing efficiency by mitigating bandwidth limitations and reducing contention in shared storage systems.
Mohamad El-Batal is the Seagate Enterprise Data Solutions (EDS) Chief Technologist, and part of the overall Seagate Office of the CTO team. He is given the opportunity to shape the Seagate EDS strategy and future storage product technology roadmap.In his career Mohamad lead a team... Read More →
As AI models and data analytics workloads continue to scale, memory bandwidth and capacity have become critical bottlenecks in modern data centers. CXL provides a high-capacity, low-latency memory expansion that can be leveraged in different usage models. CXL memory expansion and Pooling can significantly enhance SQL workload performance and reduced cloud TCO, particularly for in-memory databases and analytics workloads that are bandwidth and capacity constrained. Also Offloading the key-value (KV) cache to Compute Express Link (CXL) memory is emerging as an effective strategy to tackle memory bottlenecks and improve throughput in large language model (LLM) inference serving by storing KV cache, which is critical for efficient autoregressive generation in LLMs.
Mohamad El-Batal is the Seagate Enterprise Data Solutions (EDS) Chief Technologist, and part of the overall Seagate Office of the CTO team. He is given the opportunity to shape the Seagate EDS strategy and future storage product technology roadmap.In his career Mohamad lead a team... Read More →
AI workloads are reshaping the architecture and demands of modern data centers, calling for high-performance, scalable, and energy-efficient infrastructure. This presentation explores how AI-driven transformation is impacting data center design and operations, and highlights how Delta leverages the expertise in power and thermal solutions to meet these demands. Delta’s integrated systems play a crucial role in ensuring reliable, intelligent, and sustainable operations in the age of AI.
Liteon will share its latest advancements in power solutions for AI infrastructure, focusing on high-efficiency, high-density designs for GPU-centric systems. This session will explore how Liteon's integrated architectures support scalable deployment in modern data centers, addressing the growing demands of performance and energy optimization.
This presentation outlines the evolving requirements and technical considerations for next-generation Open Rack V3 (ORv3) Power Supply Units (PSUs) and power shelves, with a focus on the transition from ORv3 to High Power Rack (HPR) and HPR2 architectures. It highlights significant advancements such as increased power density from 33kW to 72kW, enhanced support for AI-driven pulse load demand. HVDC architecture is also introduced for quick adaptation to solve the challenging of bus bar while power demand from AI still keeps on increasing.
The shift to +/-400V DC power systems is crucial to meet the rising power demands of AI/ML applications, supporting rack densities of >140 kW. This transition introduces significant challenges for power distribution within datacenters. Critical components like bus bars, connectors, and cables must meet stringent requirements for power handling, thermal management, reliability, safety, and density. This paper explores design solutions for electromechanical interconnects in these high-power environments, drawing parallels with mature ecosystems in industries like Electric Vehicles. Innovative approaches to bus bar design and connector technology offer the performance and space savings needed for next-gen AI/ML infrastructure. The discussion addresses crucial safety aspects, including arc flash mitigation, insulation systems, and touch-safe designs. By overcoming these challenges, the industry can accelerate the transition to higher voltages, unlocking AI/ML platforms' full potential.
Trans-Inductor Voltage Regulator (TLVR) Technology is a new onboard xPU power delivery solution proposed by Google in IEEE APEC 2020. ■ TLVR is an innovative fast-transient onboard voltage regulator (VR) solution for xPUs. This VR topology provides increased VR bandwidth, faster transient response, and potential reduction in decoupling capacitors. ■ TLVR has been widely used in recent years since it offers a good transient performance with reduced equivalent output transient inductance. However, existing TLVR has not been optimized for power efficiency and density. ■ One of the limitations is that each trans-inductor has to be designed for the peak load current in terms of magnetic core saturation. ■ Zero Bias TLVR was introduced to address this limitation. It moves one phase from a primary side to a secondary side. ■ By doing so, the secondary side phase is able to drive TLVR secondary winding with equal magnitude and opposite direction to primary winding current for both DC and transient.
As AI workloads push rack power demands well beyond the ~30 kW limits of Open Rack v3, the industry has defined a High-Power Rack (HPR) standard that delivers over 200 kW per rack. This talk explains how liquid-cooled vertical busbars integrate coolant channels around copper conductors to dramatically improve heat removal and reduce I²R losses, all while fitting into existing ORv3 form factors. It also covers modular power-whip assemblies for simplified maintenance, upgraded high-voltage PSUs and battery backup units for resilience, and how OCP member companies collaborate on safety, interoperability, and scalability. Together, these innovations form an end-to-end ecosystem enabling next-generation AI data centers to meet extreme power, thermal, and reliability requirements.
As AI and ML power demands increase, driving rack power levels to 140 kW and necessitating higher voltages like +/-400V DC, optimizing bus bar systems becomes crucial for efficient, reliable power delivery. Bus bars, ideal for high-current applications, face unique challenges in high-density AI/ML racks, including thermal management, space optimization, structural rigidity, and safety. This paper explores advanced design techniques for future AI/ML power architectures, covering material selection (e.g., copper, aluminum), cross-section optimization, insulation strategies, and terminal methods. Thermal and mechanical simulations ensure performance and durability. Critical safety features, such as touch protection and creepage distances, are integrated. These solutions aim to develop robust power infrastructure for next-gen AI/ML data centers.
■ This talk traces the evolution of 48V power delivery architectures for datacenter applications since commencing with Google's introduction of a tray-level, two-stage approach at OCP in 2016. ■ Subsequent advancements in topologies and ecosystems have paved the way for collaborative standardization efforts. ■ In 2024, Google, Microsoft, Meta jointly presented an updated 48V Onboard Power Specification and Qualification Framework, leading to the formation of an OCP workstream aimed at finalizing and implementing comprehensive 48V power module solutions and qualification protocols. ■ This talk will outline critical design principles to mitigate challenges associated with 48V two-stage power delivery, encompassing power failure mechanisms in complex 48V environments, explore the challenges of high power density and physical limitations, providing a detailed electrical specification and qualification requirements for data centers applications.
In this presentation, we will look at the requirements of next generation higher power ORv3 power supplies and HVDC power shelves, which will help increase rack payload and power density yet again, while supporting key design requirements ranging from hot swapability to battery backup. Among the topics covered during the session will be an update on key design specifications and design considerations, as well as the most recent ORv3 technologies – including power supplies, power shelves, shelf controllers and battery backup solutions. We will also explore the next Generation Rack and Power roadmaps.
As the power consumption of each high densigy AI server rack goes higher and higher, the design of the cabinet can no longer only consider a signle AI server rack, but must also take the power cabinet and even the cooling cabinet into consideration. This presentation will introduce a rack architecture to integrate the AI server rack with power loop and cooling loop.
The ORV3 OCP ecosystem currently lacks robust protection for the rack-loaded lifecycle in ship-loadable packaging. This presentation will highlight the innovative packaging solution developed to ensure safe transport of a fully-loaded ORV3 system. We will delve into the design considerations that maintain both rack protection and cost-efficiency. Additionally, we will provide an overview of the extensive testing conducted to validate the system’s resilience and ensure the protection of the rack and equipment from transportation-related impacts.
This presentation offers a comprehensive overview of key accessories in the ORv3 ecosystem, highlighting two main areas: the 19” adapter and cabling & airflow management solutions. We will introduce essential components, including the 19” adapter rail, cable management arm, blanking panels, side skirts, and side expanders, detailing their design and benefits for the community. Additionally, the session will explore the extensive testing conducted on these accessories. These solutions are crucial for modern data centres, offering flexible, efficient, and organized approaches to infrastructure management.