As data traffic continues to surge across AI networks, the need for higher bandwidth and efficient signal connectivity is critical. With 200G/lambda generation well on the way to production, focus is quickly moving to the next 400G/lane SerDes, which represents a significant leap in interconnect performance. This advancement enables interconnects capable of reaching 3.2Tbps and beyond by aggregating fewer, faster lanes with the need to balance cost, power consumption and footprint per bit. In this presentation, we delve into the high-speed protocols such as Ethernet, UALink, and Ultra Ethernet – exploring the first use case where 400G/lane SerDes will potentially be deployed. We’ll take a deeper look into different modulation formats with their benefits and challenges. Special attention will be given to the adoption of optical connectivity. We aim to provide a comprehensive overview of the options available and justify their use in modern cloud service architectures.
The rapid growth of AI chips has increased computational demands, and future high-performance computing (HPC) systems are expected to integrate multiple high-power chips, resulting in total power consumption of over 2.5kW and individual chip power densities exceeding 200W/cm². To tackle these challenges, advanced cooling technologies are essential to lower thermal resistance and efficiently dissipate heat. In this paper, we explore innovative structural designs for cold plates that address critical thermal management challenges for next-generation AI systems, as well as the corresponding thermal test vehicle that can generate different power densities.
Recent advances in large-scale AI models have placed increasing pressure on the underlying compute architecture to deliver not only raw performance but also programmability and efficiency at scale. This talk introduces the Tensor Contraction Processor (TCP), a novel architecture that reconceptualizes tensor contraction as the central computational primitive, enabling a broader class of operations beyond traditional matrix multiplication. We will present the motivation behind this architectural shift, its implications for compiler design and runtime scheduling, and findings related to performance and energy efficiency. The discussion will also explore how exposing tensor contraction at the hardware level opens opportunities for more expressive and seamless execution strategies, potentially reducing data movement and improving utilization. We will share key learnings from scaling the chip across servers and racks, highlight intersections with relevant OCP Project areas, and discuss how these insights are informing our product roadmap.