modelopt NVFP4 quantized MiniMax-M2.1
Not yet extensively tested, but does appear to work fine on 2x and 4x RTX 6000 Pro Blackwell via vLLM.
If you see "No available shared memory broadcast block found in 60 seconds.", be patient.
Sample docker run (you will want to change this so it's not downloading the model repeatedly by mounting in your HF cache dir):
docker run -d \
--gpus all \
--ipc host \
--shm-size 32g \
--ulimit memlock=-1 \
--ulimit nofile=1048576 \
-v /dev/shm:/dev/shm \
-e NCCL_IB_DISABLE=1 \
-e NCCL_NVLS_ENABLE=0 \
-e NCCL_P2P_DISABLE=0 \
-e NCCL_SHM_DISABLE=0 \
-e VLLM_USE_V1=1 \
-e VLLM_USE_FLASHINFER_MOE_FP4=1 \
-e OMP_NUM_THREADS=8 \
-e SAFETENSORS_FAST_GPU=1 \
-p 0.0.0.0:8000:8000 \
vllm/vllm-openai:nightly-96142f209453a381fcaf9d9d010bbf8711119a77 \
lukealonso/MiniMax-M2.1-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--served-model-name "MiniMax-M2.1" \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--trust_remote_code \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--dtype auto \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.95 \
--all2all-backend pplx \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 16384 \
--max-num-seqs 16
- Downloads last month
- 315
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for lukealonso/MiniMax-M2.1-NVFP4
Base model
MiniMaxAI/MiniMax-M2.1