1. Products
    • Hardware
    • AI Data Center

    Entertainment and Creation

  2. Solutions
  3. MetaPark
  4. Support
  5. Company
  6. language 中文
MCCX D800

MCCX D800

LLM Training and Finetuning


Training Platform Architecture

Training Platform Architecture

Moore Threads' large model training platform is fully compatible with CUDA and PyTorch. It supports distributed training frameworks such as Megatron-LM, DeepSpeed, FSDP, and Colossal-AI. Fully compatible and with high performance and flexibility, the platform is ideal for easily training mainstream large models such as GPTs, LLaMa, and GLMs with just one click. Using the KUAE thousand-GPU computing cluster for large model training, the linear speedup ratio exceeds 91%. In addition to supporting development, it also supports training supervision, automatic restart, and resumption.

Training and Finetuning Examples

Training and Finetuning Examples

MTT S4000 features 128 Tensor Cores, 48 GB of memory, and ultra-fast inter-card communication enabled by MTLink. It supports training for various LLMs, including LLaMa, ChatGPT, ChatGLM, Qwen, and Baichuan through Moore Threads' training platform. With distributed training strategies for single and multi-node systems, it accelerates the training and finetuning of LLMs with 60 to 100 billion parameters.

Cluster Scaling Efficiency

Cluster Scaling Efficiency

The Moore Threads KUAE Computing Cluster platform, designed for billion-parameter model pretraining, finetuning, and inference, achieves a 91% linear speedup ratio in a thousand-card cluster. The platform is optimized across applications, distributed systems, training frameworks, communication libraries, firmware, operators, and hardware. Featuring MTLink, a proprietary interconnect technology based on the architecture of MTT S4000, the platform supports MTLink Bridge connections between two, four and eight cards. MTLink achieves an inter-card bandwidth of up to 240 GB/s. This accelerates training speeds in clusters with 64 to 1024 cards and improves the linearity of multi-card interconnects.

Large Model Inference Service Platform

MTT S4000, equipped with 128 Tensor Cores and 48 GB of memory, effectively supports inference for mainstream LLMs, such as LLaMa, ChatGLM, Qwen, and Baichuan.

From Chips to Clusters
Accelerating the Scale-Up of Chinese Computing Power


Supports the KUAE Series

Optimized server for clusters to train large models, providing excellent support for Moore Threads Full-Stack Solution for AI Data Centers.

Supports the KUAE AIDC Software Stack


Supports More than Just Large Models

 

MCCX D800

Specifications

Form Factor
4U Server
CPU
2 * Intel® Xeon® Gold 6430 (2.1GHz/32C/60MB270W)
GPU
8 × MTT S4000 (PCIe Gen 5 48GB VRAM)
FP32: 200 TFLOPS
FP16: 800 TFLOPS
Memory
16 × 64GB DDR5 4800MHz RDIMMs (1TB)
Storage
System drives: 2 × 480GB SATA SSDs
Data drives: 4 × 3.84TB PCIe Gen 4 NVMe SSDs
Inter-Card Connectivity
MT-Link 1.0 + PCIe Gen5 P2P
Network Cards
2 × 1 Port 400G InfiniBand NDR/ConnectX-7 Ethernet Adapter Cards
2 × 2 Port 25G Fiber Network Cards
Power Supply
4 × 2,400W Hot-Swappable Redundant N+M, Platinum Efficiency
Rails
Standard Rails
Rated Power
6000W
phone phone
Live
Agent
400-667-5666

Mon-Sun 9:00-21:00