Contact US

Knowledge and Reason are Our Arms to Embrace the World

OCS All-Optical Switching: Trends and Technologies in Optical Interconnection Networks for Smart Computing Centers in the AIGC Era

Jul 11, 2024

The explosive growth of AIGC has placed unprecedented demands on transmission bandwidth and power consumption. As the core of information processing, intelligent computing centers rely extensively on optical interconnection network technologies, particularly all-optical switching technology, known for its high bandwidth, low power consumption, and minimal latency characteristics, making it the key enabling technology for extensive data interconnections. Recently highlighted at OptiNet China 2024, Dr. Zhang Hua, Director of Solutions at LUSTER's Fiber Optic Components and Instruments Division, presented cutting-edge advancements in optical interconnection network technologies tailored for smart computing centers in the AIGC era, along with potential OCS applications within these environments.



Trends and Challenges of AIGC Data Centers

In the AIGC era, data center optical interconnections confront a dual challenge of "two highs and two lows": high bandwidth and high reliability, coupled with low power consumption and low latency.

•    High Bandwidth and High Reliability: As AI models continue to grow and become more complex, the demand for data transmission rates has soared. Traditional network architectures struggle to meet these high-speed transmission requirements, whereas optical interconnection networks can provide the necessary high bandwidth to ensure efficient data transmission. In AI training processes, any network latency or packet loss can significantly impact training outcomes. Therefore, optical interconnection networks must possess high reliability to ensure stable and accurate data transmission.

•    Low Power Consumption and Low Latency: Current AI clusters have immense energy demands, especially during large-scale training tasks, where the power consumption of network devices (such as optical modules and switches) increases significantly. New technologies and architectures are needed to reduce the energy consumption of optical interconnection networks. Moreover, low latency is crucial for the efficient operation of AI clusters, as any additional delay in large-scale parallel computing tasks can lead to decreased overall performance.




Evolution of AI Large Model Parameters

In recent times, AI clusters are increasingly demanding in scale and flexibility. Traditional fixed connections at the L1 layer (physical layer) are inadequate to meet these evolving needs, while reconfigurable optical interconnection networks can achieve dynamic adjustment and flexible expansion of AI clusters by introducing optical switches. For instance, Google's PaLM model was split across two supercomputers with 4000 TPU chips during training, spanning over 50 days. Any hardware failure could result in extended downtime for troubleshooting and repairs, whereas a reconfigurable optical interconnection network enables millisecond-level failover, significantly enhancing system stability and reliability.


OCS All-Optical Switching in AIGC Data Centers

OCS (Optical Circuit Switching) technology has gained significant attention in recent years, thanks to Google's promotion, and its applications in data centers are steadily increasing. Compared to traditional electrical switching, OCS offers low latency, low power consumption, and all-optical transparency during data transmission, adapting to future rate upgrade requirements, enabling smooth transitions for multiple rate upgrades, and reducing operational costs. Furthermore, OCS can achieve reconfigurability at the physical layer, matching different training task demands and enhancing network reliability.

For instance, NVIDIA has introduced OCS between its AI servers and Leaf nodes to achieve fault protection and recovery, significantly reducing fault recovery time. Google has also adopted OCS technology in its TPU v4 and TPU v5 networks, improving performance and availability through topological reconfiguration. According to Google's research, deploying OCS in large-scale clusters not only boosts system availability but also optimizes the performance of training tasks.



Nvidia L1-Layer Dynamic Reconfigurability Significantly Reduces Fault Convergence Time: From Hours to Seconds


Google TPU V4 OCS Interconnection Solution



Key Technologies and Application Prospects of OCS

Current commercial OCS solutions primarily include DirectLight DBS and MEMS. While MEMS-based small-to-medium matrix OCS has been applied in data center optical switching networks, the expansion of AI clusters from thousands to tens of thousands of cards or even larger scales necessitates larger matrix-scale OCS solutions. This imposes greater demands on OCS yield and reliability. DirectLight DBS technology, based on the principle of beam deflection control, achieves optical signal switching through dynamic optical path adjustment, demonstrating exceptional reliability and stability in large-scale port expansion. It has already been applied in large-scale AI cluster smart computing centers and holds promising future prospects.


The DirectLight™ beam-steering technology


Lastly, Dr. Zhang Hua concluded that as HPC and data center scales continue to grow, the demands for lower power consumption, reduced latency, and enhanced reliability are escalating. The OCS all-optical switching solution is well-suited to address these requirements and has already been successfully applied in smart computing and data centers, represented by Google. Looking ahead, as cluster sizes expand, larger-port OCS will be necessary, integrated with OEO (Optical-Electrical-Optical) switching to realize a hybrid switching architecture. Furthermore, as OCS applications in data centers further penetrate down the hierarchy (from Spine to Leaf), faster switching speeds and cost-effective OCS solutions with smaller ports will be required to further elevate data center efficiency and performance.