Release new AI storage products in the era of big models

Posted ime:2023-07-25 11:41:22  read:times


Today, we release a new AI storage product for the era of large models, providing storage optimal solutions for basic model training, industry model training, and segmented scenario model training inference, unleashing new AI momentum.


Enterprises face four major challenges in the development and implementation of large model applications:


Firstly, the data preparation time is long, the data sources are scattered, and the collection is slow. It takes about 10 days to preprocess 100 TB of data; Secondly, the multimodal large model uses massive text and images as the training set, and the current loading speed of massive small files is less than 100MB/s, resulting in low loading efficiency of the training set; Thirdly, the large model parameters are frequently tuned, and the training platform is unstable. On average, there is a training interruption every 2 days, requiring a Checkpoint mechanism to resume training. Fault recovery takes more than one day; Finally, the implementation threshold for large models is high, system construction is complex, resource scheduling is difficult, and GPU resource utilization rate is usually less than 40%.


Following the AI development trend in the era of big model, Huawei launched OceanStor A310 deep learning Data lake storage and FusionCube A3000 training/pushing super integrated machine for big model applications in different industries and scenarios.


OceanStor A310 deeply learns the Data lake storage, and is oriented to the Data lake scenario of the basic/industrial big model, realizing AI full process massive data management from data collection, pre-processing to model training, reasoning and application. The OceanStor A310 single frame 5U supports the industry's highest bandwidth of 400GB/s and the highest performance of 12 million IOPS, and can scale linearly to 4096 nodes, achieving multi-protocol lossless interoperability. The global file system GFS enables cross regional intelligent data weaving, simplifying the data collection process; By using near storage computing, near data preprocessing is achieved, reducing data movement and improving preprocessing efficiency by 30%.


FusionCube A3000 training/pushing super integrated machine, which is oriented to the training/reasoning scenarios of the industry's big model, integrates OceanStor A300 high-performance storage nodes, training/pushing nodes, switching equipment, AI platform software and management operation and maintenance software for the application of 10 billion level models, and provides the big model partners with a carry on deployment experience to achieve one-stop delivery. Ready to use out of the box, deployment can be completed within 2 hours. Training/push nodes and storage nodes can be independently and horizontally expanded to match the needs of models of different scales. At the same time, FusionCube A3000 achieves GPU sharing for multiple model training and inference tasks through high-performance containers, increasing resource utilization from 40% to over 70%. The FusionCube A3000 supports two flexible business models, including Huawei's Ascension one-stop solution and third-party partner one-stop solution for open computing, networking, and AI platform software.


Powered by ZZZcms