MCCX D800

All-in-One Server for AI LLMs Training and Inference

LLM Training and Finetuning

Training Platform Architecture

Moore Threads' large model training platform is fully compatible with CUDA and PyTorch. It supports distributed training frameworks such as Megatron-LM, DeepSpeed, FSDP, and Colossal-AI. Fully compatible and with high performance and flexibility, the platform is ideal for easily training mainstream large models such as GPTs, LLaMa, and GLMs with just one click. Using the KUAE thousand-GPU computing cluster for large model training, the linear speedup ratio exceeds 91%. In addition to supporting development, it also supports training supervision, automatic restart, and resumption.

Training and Finetuning Examples

MTT S4000 features 128 Tensor Cores, 48 GB of memory, and ultra-fast inter-card communication enabled by MTLink. It supports training for various LLMs, including LLaMa, ChatGPT, ChatGLM, Qwen, and Baichuan through Moore Threads' training platform. With distributed training strategies for single and multi-node systems, it accelerates the training and finetuning of LLMs with 60 to 100 billion parameters.

Cluster Scaling Efficiency

The Moore Threads KUAE Computing Cluster platform, designed for billion-parameter model pretraining, finetuning, and inference, achieves a 91% linear speedup ratio in a thousand-card cluster. The platform is optimized across applications, distributed systems, training frameworks, communication libraries, firmware, operators, and hardware. Featuring MTLink, a proprietary interconnect technology based on the architecture of MTT S4000, the platform supports MTLink Bridge connections between two, four and eight cards. MTLink achieves an inter-card bandwidth of up to 240 GB/s. This accelerates training speeds in clusters with 64 to 1024 cards and improves the linearity of multi-card interconnects.

Large Model Inference Service Platform

MTT S4000, equipped with 128 Tensor Cores and 48 GB of memory, effectively supports inference for mainstream LLMs, such as LLaMa, ChatGLM, Qwen, and Baichuan.

From Chips to Clusters
Accelerating the Scale-Up of Chinese Computing Power

Supports the KUAE Series

Optimized server for clusters to train large models, providing excellent support for Moore Threads Full-Stack Solution for AI Data Centers.

Supports the KUAE AIDC Software Stack

Supports More than Just Large Models

MCCX D800

Specifications

Form Factor

4U Server

CPU

2 * Intel® Xeon® Gold 6430 (2.1GHz/32C/60MB270W)

GPU

8 × MTT S4000 (PCIe Gen 5 48GB VRAM)
FP32: 200 TFLOPS
FP16: 800 TFLOPS

Memory

16 × 64GB DDR5 4800MHz RDIMMs (1TB)

Storage

System drives: 2 × 480GB SATA SSDs
Data drives: 4 × 3.84TB PCIe Gen 4 NVMe SSDs

Inter-Card Connectivity

MT-Link 1.0 + PCIe Gen5 P2P

Network Cards

2 × 1 Port 400G InfiniBand NDR/ConnectX-7 Ethernet Adapter Cards
2 × 2 Port 25G Fiber Network Cards

Power Supply

4 × 2,400W Hot-Swappable Redundant N+M, Platinum Efficiency

Rails

Standard Rails

Rated Power

6000W

MCCX D800

All-in-One Server for AI LLMs Training and Inference

LLM Training and Finetuning

Training Platform Architecture

Training and Finetuning Examples

Cluster Scaling Efficiency

Large Model Inference Service Platform

From Chips to ClustersAccelerating the Scale-Up of Chinese Computing Power

Supports the KUAE Series

Supports the KUAE AIDC Software Stack

Supports More than Just Large Models

MCCX D800

Specifications

From Chips to Clusters
Accelerating the Scale-Up of Chinese Computing Power