modelopt NVFP4 quantized MiniMax-M2.1

Not yet extensively tested, but does appear to work fine on 2x and 4x RTX 6000 Pro Blackwell via vLLM.

If you see "No available shared memory broadcast block found in 60 seconds.", be patient.

Sample docker run (you will want to change this so it's not downloading the model repeatedly by mounting in your HF cache dir):

docker run -d \
  --gpus all \
  --ipc host \
  --shm-size 32g \
  --ulimit memlock=-1 \
  --ulimit nofile=1048576 \
  -v /dev/shm:/dev/shm \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_NVLS_ENABLE=0 \
  -e NCCL_P2P_DISABLE=0 \
  -e NCCL_SHM_DISABLE=0 \
  -e VLLM_USE_V1=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e OMP_NUM_THREADS=8 \
  -e SAFETENSORS_FAST_GPU=1 \
  -p 0.0.0.0:8000:8000 \  
  vllm/vllm-openai:nightly-96142f209453a381fcaf9d9d010bbf8711119a77 \
  lukealonso/MiniMax-M2.1-NVFP4 \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name "MiniMax-M2.1" \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \  
  --trust_remote_code \
  --tensor-parallel-size 2 \
  --enable-expert-parallel \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95 \
  --all2all-backend pplx \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 16
Downloads last month
315
Safetensors
Model size
115B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lukealonso/MiniMax-M2.1-NVFP4

Quantized
(22)
this model