modelopt NVFP4 quantized MiniMax-M2.1

Not yet extensively tested, but does appear to work fine on 2x and 4x RTX 6000 Pro Blackwell via vLLM.

If you see "No available shared memory broadcast block found in 60 seconds.", be patient.

Sample docker run (you will want to change this so it's not downloading the model repeatedly by mounting in your HF cache dir):

docker run -d \
  --gpus all \
  --ipc host \
  --shm-size 32g \
  --ulimit memlock=-1 \
  --ulimit nofile=1048576 \
  -v /dev/shm:/dev/shm \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_NVLS_ENABLE=0 \
  -e NCCL_P2P_DISABLE=0 \
  -e NCCL_SHM_DISABLE=0 \
  -e VLLM_USE_V1=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e OMP_NUM_THREADS=8 \
  -e SAFETENSORS_FAST_GPU=1 \
  -p 0.0.0.0:8000:8000 \  
  vllm/vllm-openai:nightly-96142f209453a381fcaf9d9d010bbf8711119a77 \
  lukealonso/MiniMax-M2.1-NVFP4 \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name "MiniMax-M2.1" \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \  
  --trust_remote_code \
  --tensor-parallel-size 2 \
  --enable-expert-parallel \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95 \
  --all2all-backend pplx \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 16

Downloads last month: 315

Safetensors

Model size

115B params

Tensor type

BF16

F32

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lukealonso/MiniMax-M2.1-NVFP4

Base model

MiniMaxAI/MiniMax-M2.1

Quantized

(22)

this model