Mon-Sun 9:00-21:00
LLM Training and Finetuning
Training Platform Architecture
Moore Threads' large model training platform is fully compatible with CUDA and PyTorch. It supports distributed training frameworks such as Megatron-LM, DeepSpeed, FSDP, and Colossal-AI. Fully compatible and with high performance and flexibility, the platform is ideal for easily training mainstream large models such as GPTs, LLaMa, and GLMs with just one click. Using the KUAE thousand-GPU computing cluster for large model training, the linear speedup ratio exceeds 91%. In addition to supporting development, it also supports training supervision, automatic restart, and resumption.
Training and Finetuning Examples
MTT S4000 features 128 Tensor Cores, 48 GB of memory, and ultra-fast inter-card communication enabled by MTLink. It supports training for various LLMs, including LLaMa, ChatGPT, ChatGLM, Qwen, and Baichuan through Moore Threads' training platform. With distributed training strategies for single and multi-node systems, it accelerates the training and finetuning of LLMs with 60 to 100 billion parameters.
Cluster Scaling Efficiency
The Moore Threads KUAE Computing Cluster platform, designed for billion-parameter model pretraining, finetuning, and inference, achieves a 91% linear speedup ratio in a thousand-card cluster. The platform is optimized across applications, distributed systems, training frameworks, communication libraries, firmware, operators, and hardware. Featuring MTLink, a proprietary interconnect technology based on the architecture of MTT S4000, the platform supports MTLink Bridge connections between two, four and eight cards. MTLink achieves an inter-card bandwidth of up to 240 GB/s. This accelerates training speeds in clusters with 64 to 1024 cards and improves the linearity of multi-card interconnects.
Large Model Inference Service Platform
MTT S4000, equipped with 128 Tensor Cores and 48 GB of memory, effectively supports inference for mainstream LLMs, such as LLaMa, ChatGLM, Qwen, and Baichuan.
From Chips to Clusters
Accelerating the Scale-Up of Chinese Computing Power
Supports the KUAE Series
Optimized server for clusters to train large models, providing excellent support for Moore Threads Full-Stack Solution for AI Data Centers.
Supports the KUAE AIDC Software Stack
Supports More than Just Large Models
MCCX D800
Specifications
FP32: 200 TFLOPS
FP16: 800 TFLOPS
Data drives: 4 × 3.84TB PCIe Gen 4 NVMe SSDs
2 × 2 Port 25G Fiber Network Cards