2025 OCP APAC Summit: Full Schedule

arrow_back View All Dates

9:00am PDT

Keynotes

Tuesday August 5, 2025 9:00am - 11:30am PDT

TaiNEX 2 - 701 CD

Tuesday August 5, 2025 9:00am - 11:30am PDT
TaiNEX 2 - 701 CD

Keynotes

12:00pm PDT

Lunch Sponsored by ASE

Tuesday August 5, 2025 12:00pm - 1:00pm PDT

Taipei Nangang Exhibition Center Hall 2

Tuesday August 5, 2025 12:00pm - 1:00pm PDT
Taipei Nangang Exhibition Center Hall 2

Meals, Lunch

1:00pm PDT

Architecting the AI Fabric: Scalable Networking for Next-Generation AI Servers, Racks, and Clusters

Tuesday August 5, 2025 1:00pm - 1:20pm PDT

TaiNEX2 - 701 F

AI workloads demand unprecedented levels of bandwidth, low latency, and deterministic communication across increasingly dense compute infrastructures. This work focuses on emerging network architectures tailored for AI servers, racks, and clusters—highlighting trends such as high-radix topologies, RDMA over converged Ethernet (RoCE), optical interconnects, and in-network compute. It examines how networking shapes system performance, scalability, and efficiency, and outlines architectural strategies to address bottlenecks in collective communication, model parallelism, and distributed training at hyperscale.

Speakers

Jalpa Patel

Meta

Tuesday August 5, 2025 1:00pm - 1:20pm PDT
TaiNEX2 - 701 F

Networking

1:00pm PDT

Introduction of IOWN Global Forum

Tuesday August 5, 2025 1:00pm - 1:20pm PDT

TaiNEX 2 - 701 E

Speakers

Masahisa Kawashima

IOWN Technology Director, IOWN Development Office, R&D Planning Department, NTT

Masahisa Kawashima is currently leading NTT's R&D of Innovative Optical and Wireless Network (IOWN) as the IOWN Technology Director. He is also serving as the Chair of the Technology Working Group at IOWN Global Forum. He has been working as a bridge between technologies and businesses... Read More →

Tuesday August 5, 2025 1:00pm - 1:20pm PDT
TaiNEX 2 - 701 E

Optical Communication Networks

1:00pm PDT

Opening Track Overview / APAC Based Project

Tuesday August 5, 2025 1:00pm - 1:20pm PDT

TaiNEX2 - 701 G

Speakers

Mohamad H El-Batal

Office of the CTO, Seagate

Mohamad El-Batal is the Seagate Enterprise Data Solutions (EDS) Chief Technologist, and part of the overall Seagate Office of the CTO team. He is given the opportunity to shape the Seagate EDS strategy and future storage product technology roadmap.In his career Mohamad lead a team... Read More →

Jungsoo Kim

Fadu

Tuesday August 5, 2025 1:00pm - 1:20pm PDT
TaiNEX2 - 701 G

Storage

1:20pm PDT

Revisit RoCEv2 issues in large scale deployment and the future that UEC promise

Tuesday August 5, 2025 1:20pm - 1:40pm PDT

TaiNEX2 - 701 F

RoCEv2 is getting widely deployed due to emerging GenAI trend, and there is growing needs to mix AI workload and HPC workload to maximize infrastructure investment efficiency, RoCEv2 which is developed decades ago for simple workload starts to show its issues in hyperscale deployment, this leads to the development of UEC – Ultra Ethernet Consortium.

Speakers

PoWen Tsai

Director, Technical Sales, Edgecore

Suleman Azeem

Technical Product Management Executive, AMD

Tuesday August 5, 2025 1:20pm - 1:40pm PDT
TaiNEX2 - 701 F

Networking

1:20pm PDT

Introduction to OCP for (new) IOWN audience

Tuesday August 5, 2025 1:20pm - 1:40pm PDT

TaiNEX 2 - 701 E

Speakers

Cliff Grossner

Chief Innovation Officer, Open Compute Project Foundation (OCP)

James Kelly

VP of Innovation, Open Compute Project Foundation (OCP)

Tuesday August 5, 2025 1:20pm - 1:40pm PDT
TaiNEX 2 - 701 E

Optical Communication Networks

1:40pm PDT

Enhancing Financial Intercommunication through IOWN/APN: The Future of Data Center Connectivity

Tuesday August 5, 2025 1:40pm - 1:55pm PDT

TaiNEX 2 - 701 E

MUFG Bank and NTTDATA worked with the IOWN Global Forum to create use cases and Reference Implementation Models demonstrating how APN and Optical DC networks can transform financial systems. A white paper was published detailing these innovations and their applications. MUFG Bank, NTTDATA, and NTT West tested their ideas through pilot experiments, yielding two major outcomes: real-time database synchronization reduces reliance on complex backup frameworks, and virtual instances enable seamless Data Center transitions without downtime, enhancing efficiency. This allows financial institutions to meet evolving customer demands more effectively. Research is ongoing to integrate Optical Network technology with OCP hardware to address latency, bandwidth, and prioritization issues for NIC-to-NIC communication. By combining software, hardware, and networks, NTTDATA aims to create smarter, more resilient financial systems.

Speakers

Hiroshi Miura

Senior Manager, NTT Data

Masayoshi Namba

Expert, MUFG Bank

Tuesday August 5, 2025 1:40pm - 1:55pm PDT
TaiNEX 2 - 701 E

Optical Communication Networks

1:40pm PDT

Evolving FBOSS to support Generative AI network workloads

Tuesday August 5, 2025 1:40pm - 2:00pm PDT

TaiNEX2 - 701 F

FBOSS is Meta’s own Software Stack for managing Network Switches deployed in Meta’s data centers. It is one of the largest services in Meta in terms of the number of instances deployed.

Network Traffic in AI Fabric presents unique challenges such as “elephant flows” (a small number of extremely large, continuous flows), and low entropy (limited variation in flow characteristics, increasing likelihood of hash collisions).

At OCP 2024, we showcased how we evolved FBOSS to tackle these challenges. This solution is capable of building non-blocking clusters for up to 4K GPUs. However, generative AI use cases demand significantly larger non-blocking clusters. This can be solved by interconnecting multiple 4K GPU clusters into a single, larger cluster using traditional Routing and ECMP. In this design, intra-cluster traffic benefits from non-blocking I/O, but inter-cluster traffic continues to suffer from poor network performance due to the aforementioned elephant flows and low entropy.

In this talk, we will share our journey evolving FBOSS for generative AI workloads. We will discuss the hierarchical design that enables us to build significantly larger non-blocking clusters, the unique challenges we encountered in scaling both the dataplane and control plane, and the solutions we developed to overcome them. Additionally, we will highlight the SAI enhancements that were instrumental in adapting FBOSS to support the demands of generative AI.

Speakers

Mehak Mahajan

Broadcom

Jasmeet Bagga

Meta

Shrikrishna Khare

Meta

Tuesday August 5, 2025 1:40pm - 2:00pm PDT
TaiNEX2 - 701 F

Networking

1:40pm PDT

Use Cases for CXL RAS Firmware-First Error Handling

Tuesday August 5, 2025 1:40pm - 2:00pm PDT

TaiNEX2 - 701 G

Highlight the CXL RAS specific Firmware First error handling use cases that are implemented for the CXL specification which is generic in nature. When developing the support for the CXL RAS Firmware First, encountered different use cases that needs to be solved to implement the firmware and customer problems. The use cases resolution would involve the design and implementation of the firmware to handle them. Usage of primary and secondary Mailbox to overcome the IHV early adoption in engineering/debug effort. Common error signaling protocols usage for protocol errors. GUID and UUID usage in firmware on both CPU/host side, CXL devices and Operating system. Interaction communication failures between the CPU and CXL devices during boot time and run time that needs to be notified to users/operating system. Techniques used to improve the SMI latency. Error pollution use cases handling on both protocol, Memory error firmware notification (MEFN) and Flat2LM cases. The CXL ecosystem comprises of multitude of component vendors like SoC, Memory, Storage, Networking, etc. The explosive growth of internet content and the resulting data storage and computation requirements has resulted in the deployment of heterogenous and complex solutions in the very large-scale data centers. These warehouse sized buildings are packed with server, storage and network hardware. Specifically if there is a uncorrected fatal error detected by hardware that pose a containment risk. The system needs to be reset and restarted if possible to enable continued operation. The error affects the entire CXL device, a persistent/permanent memory device is considered to have experienced a dirty shut-down.

Speakers

Manjunaatha Harapanahalli

Firmware Architect, Intel

Tuesday August 5, 2025 1:40pm - 2:00pm PDT
TaiNEX2 - 701 G

Storage

1:55pm PDT

World first international APN and RDMA-over-APN evaluation

Tuesday August 5, 2025 1:55pm - 2:10pm PDT

TaiNEX 2 - 701 E

Speakers

Jhih-Heng Yan

Chunghwa Telecom

Tuesday August 5, 2025 1:55pm - 2:10pm PDT
TaiNEX 2 - 701 E

Optical Communication Networks

2:00pm PDT

New paradigm for lossless DCI interconnect in the AI/ML era

Tuesday August 5, 2025 2:00pm - 2:20pm PDT

TaiNEX2 - 701 F

While AI/ML clusters continue to scale and are breaching the boundaries of physical locations in terms of both size and power - the need to scale and interconnect different locations becomes ever more crucial.

When new challenges of interconnected locations are extended to these use-cases, few considerations have to be met:
- Allowing high bandwidth to be effectively used between geographically dispersed location through various distances
- Support for lossless RDMA traffic
- Simple and condensed interconnection layer

The presentation will focus on how Broadcom’s Jericho product line allows for the implementation of such needs with innovations throughout the stack - from physical connectivity and all the way to intelligent load-balancing.

Speakers

Amir Krayden

Sr Director Marketing, Broadcom

Tuesday August 5, 2025 2:00pm - 2:20pm PDT
TaiNEX2 - 701 F

Networking

2:10pm PDT

APN-T 400G Muxponder for Cloud Edge Computing

Tuesday August 5, 2025 2:10pm - 2:25pm PDT

TaiNEX 2 - 701 E

Speakers

Victor Yen

Pegatron

ArthurSH Chang

Pegatron

Tuesday August 5, 2025 2:10pm - 2:25pm PDT
TaiNEX 2 - 701 E

Optical Communication Networks

2:20pm PDT

Dynamic ECN Threshold Testing Methodology and the Importance of qp-fairness

Tuesday August 5, 2025 2:20pm - 2:40pm PDT

TaiNEX2 - 701 F

This presentation will focus on an innovative dynamic Explicit Congestion Notification (ECN) threshold testing methodology, emphasizing the design rationale for test cases and the observational analysis of experimental results. We will explore how designed test cases trigger ECN threshold changes in dynamic network environments, ensuring comprehensive and effective testing.

A key insight from our research is the critical role of qp-fairness (Queue Pair fairness) in collective benchmarking, alongside traditional metrics like algorithmic bandwidth and bus bandwidth. Through comparative analysis of real-world test data, we demonstrate how maintaining qp-fairness under dynamic conditions significantly enhances the stability of ECN mechanisms and ensures equitable allocation of network resources.
By aligning theoretical insights with practical implementations, we hope to provide actionable insights for advancing research and applications in dynamic ECN technologies.

Speakers

Eric Yu

Solution Architect, Keysight

Tuesday August 5, 2025 2:20pm - 2:40pm PDT
TaiNEX2 - 701 F

Networking

2:20pm PDT

Architecting AI Storage: A Strategy for High-Performance and Tenant Object Storage

Tuesday August 5, 2025 2:20pm - 2:40pm PDT

TaiNEX2 - 701 G

This session will share our practical experience in AI data center architecture planning, covering the critical interplay of computing, storage, and management.
We'll delve into our comprehensive approach to designing scalable and efficient AI infrastructure.

A significant portion of our discussion will be dedicated to our innovative storage architecture, which addresses diverse AI workload demands through a dual-part strategy. First, we will present our high-performance storage solution, leveraging BeeGFS and GRAID's cutting-edge technologies with NVMe SSDs to meet the demands of intense AI computation. Second, we will explore our approach to tenant object storage, specifically utilizing GRAID's SupremeRAID alongside NVMe SSDs to provide robust and scalable data management for various user requirements.

Speakers

Wenyu Chen

Infinitix

Tuesday August 5, 2025 2:20pm - 2:40pm PDT
TaiNEX2 - 701 G

Storage

2:25pm PDT

Operational challenges for realizing a multi-vendor CDI configuration

Tuesday August 5, 2025 2:25pm - 2:40pm PDT

TaiNEX 2 - 701 E

NTT is considering Data-Centric Infrastructure (DCI) using IOWN technologies. DCI processes data efficiently by combining geographically dispersed resources. To achieve this, we’re verifying Composable Disaggregated Infrastructure (CDI) – a flexible hardware solution – and considering a multi-vendor approach. CDI consists of servers, PCIe expansion boxes, and switches, enabling software-controlled allocation of accelerators for optimal performance. Utilizing multi-vendor CDI requires an interface like OFA Sunfish to reduce operational costs. Our verification has revealed challenges in the physical operation of CDI and implementing a multi-vendor configuration. These include increased cabling costs, racking limitations, and inconsistencies in product functionalities and procedures requiring careful configuration management. This session will share these challenges and proposed solutions.

Speakers

Kensuke Koda

Senior Research Engineer, NTT

Tuesday August 5, 2025 2:25pm - 2:40pm PDT
TaiNEX 2 - 701 E

Optical Communication Networks

2:35pm PDT

Fast and Accurate Heat Dispersion Modeling for Data Centers

Tuesday August 5, 2025 2:35pm - 3:00pm PDT

TaiNEX2 - 703

Speakers

Chi-Chuan Wang

National Yang Ming Chiao Tung University

Tuesday August 5, 2025 2:35pm - 3:00pm PDT
TaiNEX2 - 703

Future Technologies Symposium

2:40pm PDT

Tenant Aware VM and Container Integrated Layer 2 Networks

Tuesday August 5, 2025 2:40pm - 2:55pm PDT

TaiNEX 2 - 701 E

The proposed Layer 2 transparent network, bridging VM and Container networks, is a software-defined network for AI services deployment. The cloud provider offers a tenant-aware and transparent combining VM and Container network into the same network domain. The benefits of this network are to provide the full Layer 2 network and to reduce communication overhead in the multi-tenant cloud system. The tenant can deploy its services in VMs and Containers. The communications among VMs and containers are in the same Layer 2 domain. It could reduce routing efforts and isolate the network traffic among different tenants.

Speakers

Chun-Chieh Huang

Deputy Technical Director, ITRI

Tuesday August 5, 2025 2:40pm - 2:55pm PDT
TaiNEX 2 - 701 E

Optical Communication Networks

2:40pm PDT

e-Tube Technology - Breaking Interconnect Barriers To Accelerate AI Cluster Scale Up

Tuesday August 5, 2025 2:40pm - 3:00pm PDT

TaiNEX2 - 701 F

Decades-old copper and optical interconnect technologies limit AI cluster compute efficiency. The presentation will showcase e-Tube Technology - RF data transmission over plastic waveguide - and how it breaks the barriers of these legacy technologies by providing near-zero latency and 3x better energy efficiency than optics at a cost structure similar to copper. e-Tube is an ideal replacement for copper for terabit interconnect to scale up next-generation AI clusters.

Speakers

David Kuo

VP of Product Marketing and Business Development, Point2 Technology

Tuesday August 5, 2025 2:40pm - 3:00pm PDT
TaiNEX2 - 701 F

Networking

2:40pm PDT

Optimizing I/O for AI Workloads and Boosting Checkpointing with FDP

Tuesday August 5, 2025 2:40pm - 3:00pm PDT

TaiNEX2 - 701 G

The demands of AI training and inference workloads heavily rely on efficient I/O operations. This presentation will explore how SSDs are architected to deliver superior performance across diverse I/O characteristics inherent in these AI processes. We'll demonstrate the advantages of NVMe SSDs in both training and inference environments, highlighting their strengths in handling various I/O patterns. Furthermore, we will delve into the critical aspect of checkpointing performance in AI pipelines, specifically showcasing how FDP significantly enhances checkpointing efficiency by mitigating bandwidth limitations and reducing contention in shared storage systems.

Speakers

Jungsoo Kim

Fadu

Tuesday August 5, 2025 2:40pm - 3:00pm PDT
TaiNEX2 - 701 G

Storage

3:00pm PDT

Break

Tuesday August 5, 2025 3:00pm - 3:15pm PDT

Taipei Nangang Exhibition Center Hall 2

Tuesday August 5, 2025 3:00pm - 3:15pm PDT
Taipei Nangang Exhibition Center Hall 2

Meals, Break

3:15pm PDT

Remote GPU over APN/Confidential computing

Tuesday August 5, 2025 3:15pm - 3:30pm PDT

TaiNEX 2 - 701 E

Speakers

Kevin Chen

NVIDIA

Tuesday August 5, 2025 3:15pm - 3:30pm PDT
TaiNEX 2 - 701 E

Optical Communication Networks

3:15pm PDT

NVMe

Tuesday August 5, 2025 3:15pm - 3:35pm PDT

TaiNEX2 - 701 G

Speakers

Mohamad H El-Batal

Office of the CTO, Seagate

Tuesday August 5, 2025 3:15pm - 3:35pm PDT
TaiNEX2 - 701 G

Storage

3:15pm PDT

Advancements in 400G/Lane SerDes for High-Speed AI Connectivity

Tuesday August 5, 2025 3:15pm - 3:40pm PDT

TaiNEX2 - 703

As data traffic continues to surge across AI networks, the need for higher bandwidth and efficient signal connectivity is critical. With 200G/lambda generation well on the way to production, focus is quickly moving to the next 400G/lane SerDes, which represents a significant leap in interconnect performance. This advancement enables interconnects capable of reaching 3.2Tbps and beyond by aggregating fewer, faster lanes with the need to balance cost, power consumption and footprint per bit. In this presentation, we delve into the high-speed protocols such as Ethernet, UALink, and Ultra Ethernet – exploring the first use case where 400G/lane SerDes will potentially be deployed. We’ll take a deeper look into different modulation formats with their benefits and challenges. Special attention will be given to the adoption of optical connectivity. We aim to provide a comprehensive overview of the options available and justify their use in modern cloud service architectures.

Speakers

Susmita Joshi

Product Line Manager, Astera

Tuesday August 5, 2025 3:15pm - 3:40pm PDT
TaiNEX2 - 703

Future Technologies Symposium

3:15pm PDT

Panel: End-to-End Observability Across Network and Compute Layers for AI Workload Optimization

Tuesday August 5, 2025 3:15pm - 3:45pm PDT

TaiNEX2 - 701 F

Traditional network infrastructure observability tools fall short in AI environments, where interdependence between networking and computing layers directly impacts inference latency and throughput. Modern AI workloads—particularly large language models and computer vision pipelines—demand synchronized visibility across the data transport path (RDMA/GPU-to-GPU) and GPU execution stack to ensure performance consistency, avoid bottlenecks, and support real-time SLAs.

Our panelists will share their views and real world learnings on the required observabiolitty paradigm shiftings in opened networking in terms of architecture design, telemetry stack, policy engine, etc. that drives closed loop observability

Moderators

Tim Zhou

Accton

Speakers

Stefan Bokaie

CTO, Dorado Software

Stefan is a growth-focused and dynamic executive with extensive experience in leading all facets of technical operations. Stefan is currently serving as CTO of Dorado Software, a leading provider of Fabric Orchestration and Management for Enterprise, Cloud and Telco. Stefan's prior... Read More →

William Chiang

Edgecore Networks

Hasan Siraj

Broadcom

Amir Elbaz

Beyond Edge Networks

Tuesday August 5, 2025 3:15pm - 3:45pm PDT
TaiNEX2 - 701 F

Networking

3:30pm PDT

Ideal AI/ML optical interconnect solution for sustainable data centers

Tuesday August 5, 2025 3:30pm - 3:45pm PDT

TaiNEX 2 - 701 E

Speakers

Taeyong Kim

Lessengers

Tuesday August 5, 2025 3:30pm - 3:45pm PDT
TaiNEX 2 - 701 E

Optical Communication Networks

3:40pm PDT

Pioneering AI Cooling: High Power Cold Plates with 200W/cm² Heat Flux for Next-Gen AI ASICs

Tuesday August 5, 2025 3:40pm - 4:05pm PDT

TaiNEX2 - 703

The rapid growth of AI chips has increased computational demands, and future high-performance computing (HPC) systems are expected to integrate multiple high-power chips, resulting in total power consumption of over 2.5kW and individual chip power densities exceeding 200W/cm². To tackle these challenges, advanced cooling technologies are essential to lower thermal resistance and efficiently dissipate heat. In this paper, we explore innovative structural designs for cold plates that address critical thermal management challenges for next-generation AI systems, as well as the corresponding thermal test vehicle that can generate different power densities.

Speakers

Jeff CA Chen

Thermal Engineer, Wiwynn

Jake Hsieh

Thermal Engineer, Wiwynn

Tuesday August 5, 2025 3:40pm - 4:05pm PDT
TaiNEX2 - 703

Future Technologies Symposium

3:45pm PDT

KV Cache Sharing over IOWN APN: Building a Sustainable and High-Performance Nation-Wide Distributed AI for LLM Inference

Tuesday August 5, 2025 3:45pm - 4:00pm PDT

TaiNEX 2 - 701 E

Inference processing of large language models (LLMs) is computationally intensive, and efficient management and reuse of intermediate data, known as KV Cache, are crucial for performance improvement. In this presentation, we propose a novel architecture leveraging NTT's innovative photonics-based networking technology, "IOWN APN (All-Photonics Network)," to enable low-latency, high-bandwidth sharing of large-scale KV Cache among geographically distributed data centers. By exploiting the unique capabilities of IOWN APN, the proposed KV Cache sharing system significantly enhances inference throughput and improves power efficiency, paving the way for reduced environmental impact and more sustainable operational models for LLM inference. Through this presentation, we aim to engage with the OCP community to discuss the potential for wide-area distributed AI computing based on open standards.

Speakers

Kenji Tanaka

Research Engineer, NTT

Tuesday August 5, 2025 3:45pm - 4:00pm PDT
TaiNEX 2 - 701 E

Optical Communication Networks

3:45pm PDT

Proactive Link Management in AI Networks: Lessons from Meta

Tuesday August 5, 2025 3:45pm - 4:05pm PDT

TaiNEX2 - 701 F

In the realm of AI networks, the health of physical links is paramount to ensuring optimal performance and reliability. At Meta, we recognize that robust physical connectivity is crucial for the seamless operation of AI workloads, which demand high-speed and reliable data transmission. This presentation will delve into Meta's comprehensive strategy for maintaining healthy physical links within our AI networks.

We will explore the significance of link health in AI networks, emphasizing how it impacts overall system efficiency and performance. Meta employs advanced physical layer diagnostics, including Pseudo-Random Binary Sequence (PRBS) and Forward Error Correction (FEC) diagnostics, to rigorously test and validate link integrity before deployment into production. These diagnostics help identify potential issues, ensuring only healthy links are operational.

Furthermore, we will discuss Meta's proactive approach to managing link health in production environments. Unhealthy links are swiftly removed from service, and an automated triage pipeline is employed to facilitate effective repairs. This pipeline not only enhances the speed and accuracy of link restoration but also minimizes downtime, thereby maintaining the high reliability standards expected in AI network operations.

Speakers

Harshit Gulati

Meta

Tuesday August 5, 2025 3:45pm - 4:05pm PDT
TaiNEX2 - 701 F

Networking

4:00pm PDT

Towards Energy-Efficient AI Infrastructure through the Integration of Computing and Networking

Tuesday August 5, 2025 4:00pm - 4:15pm PDT

TaiNEX 2 - 701 E

Speakers

Hirochika Asai

VP of Infrastructure Strategy, Preferred Networks

Tuesday August 5, 2025 4:00pm - 4:15pm PDT
TaiNEX 2 - 701 E

Optical Communication Networks

4:05pm PDT

Adaptive Multi-Tenant Orchestration of AI Fabric

Tuesday August 5, 2025 4:05pm - 4:25pm PDT

TaiNEX2 - 701 F

This presentation delves into challenges and opportunities for AIaaS-providers to efficiently deploy and manage multi-tenant AI fabrics and clusters. Deployment of SONiC AI infrastructure with optimal tuning especially for AIaaS provider, or and Enterprise supporting Inference at Edge can be a complex and daunting task. We will present the required features to simplify deployment of backend AI SONiC Fabrics in a controller. Tuning fabrics supporting AI must be take into consideration factors such as, AI job type as well as its sensitivity to latency, tier of the tenant scheduling the job, and tuning capabilities of the underlying SONiC platforms, and implement an adaptive solution.
The presentation introduces the concept of AI tenancy,, and how tenancy could be considered when orchestrating and tuning the underlying infrastructure.

Speakers

Stefan Bokaie

CTO, Dorado Software

Tuesday August 5, 2025 4:05pm - 4:25pm PDT
TaiNEX2 - 701 F

Networking

4:05pm PDT

Thermal-Safe Operation for GPU Under Power Constraints Using Reinforcement Learning

Tuesday August 5, 2025 4:05pm - 4:30pm PDT

TaiNEX2 - 703

Speakers

Jia-Han Li

Department of Engineering Science and Ocean Engineering, National Taiwan University

Tuesday August 5, 2025 4:05pm - 4:30pm PDT
TaiNEX2 - 703

Future Technologies Symposium

4:15pm PDT

Optical circuit switching technology for future AI clusters

Tuesday August 5, 2025 4:15pm - 4:30pm PDT

TaiNEX 2 - 701 E

Speakers

Ken Mizutani

AIST

Tuesday August 5, 2025 4:15pm - 4:30pm PDT
TaiNEX 2 - 701 E

Optical Communication Networks

4:15pm PDT

Unlocking Next-Generation Data Center Performance: Leveraging CXL Memory for Scalable AI and Database Workloads

Tuesday August 5, 2025 4:15pm - 4:35pm PDT

TaiNEX2 - 701 G

As AI models and data analytics workloads continue to scale, memory bandwidth and capacity have become critical bottlenecks in modern data centers. CXL provides a high-capacity, low-latency memory expansion that can be leveraged in different usage models. CXL memory expansion and Pooling can significantly enhance SQL workload performance and reduced cloud TCO, particularly for in-memory databases and analytics workloads that are bandwidth and capacity constrained. Also Offloading the key-value (KV) cache to Compute Express Link (CXL) memory is emerging as an effective strategy to tackle memory bottlenecks and improve throughput in large language model (LLM) inference serving by storing KV cache, which is critical for efficient autoregressive generation in LLMs.

Speakers

Sukhbinder Singh

Intel

Tuesday August 5, 2025 4:15pm - 4:35pm PDT
TaiNEX2 - 701 G

Storage

4:25pm PDT

Scale-up AI Networking Alternatives - Comparing UALink, SUE and NVLink

Tuesday August 5, 2025 4:25pm - 5:00pm PDT

TaiNEX2 - 701 F

Speakers

Sharada Yeluri

Astera Labs

Tuesday August 5, 2025 4:25pm - 5:00pm PDT
TaiNEX2 - 701 F

Networking

4:30pm PDT

Reference Implementation of an SDN Controller for Open Optical-circuit-switched AI Clusters

Tuesday August 5, 2025 4:30pm - 4:45pm PDT

TaiNEX 2 - 701 E

The introduction of optical-circuit-switches (OCSs) has been considered as key to cost-effectively scale the AI interconnect infrastructure. However, current AI interconnect is realized by vendor-proprietary hardware and software solutions and we thus lack the interoperability and openness in this domain. This could lead to increase both capital and operational expenditure for GPU service providers. Recently, IOWN Global Forum started an activity on defining a reference implementation model for the AI interconnect infrastructure. Among several study items on that activity, this presentation introduces an open network controller framework for managing the AI interconnect with multi-vendor OCSs.

Speakers

Kazuya Anazawa

Researcher, NTT

Atsushi Yamamoto

Senior Research Engineer, NTT

Tuesday August 5, 2025 4:30pm - 4:45pm PDT
TaiNEX 2 - 701 E

Optical Communication Networks

4:30pm PDT

Tensor Contraction as a First-Class Primitive: Learnings from Building Efficient and Programmable AI Compute

Tuesday August 5, 2025 4:30pm - 4:55pm PDT

TaiNEX2 - 703

Recent advances in large-scale AI models have placed increasing pressure on the underlying compute architecture to deliver not only raw performance but also programmability and efficiency at scale. This talk introduces the Tensor Contraction Processor (TCP), a novel architecture that reconceptualizes tensor contraction as the central computational primitive, enabling a broader class of operations beyond traditional matrix multiplication. We will present the motivation behind this architectural shift, its implications for compiler design and runtime scheduling, and findings related to performance and energy efficiency. The discussion will also explore how exposing tensor contraction at the hardware level opens opportunities for more expressive and seamless execution strategies, potentially reducing data movement and improving utilization. We will share key learnings from scaling the chip across servers and racks, highlight intersections with relevant OCP Project areas, and discuss how these insights are informing our product roadmap.

Speakers