MiniMax-M2.7 -TurboQuant+ Config-I (MLX)
93.5% MMLU at 87 GB. 61 tok/s decode. PPL 4.604. 228B-parameter MoE compressed 62% with Config-I mixed-precision quantization. Standard MLX format -works with stock mlx_lm and mlx-swift-lm. No custom loaders required.
Config-I quantization of MiniMaxAI/MiniMax-M2.7 (228.7B total, ~1.4B active per token). The policy applies aggressive 2-bit compression to expert MLPs (where MoE is most tolerant), protects attention at 4-bit, and shields boundary layers and routing at full precision. See the Config-I paper for the policy derivation.
Compression
| Size | |
|---|---|
| FP8 source | 230 GB |
| Config-I (3.25 bpw) | 87 GB |
| Reduction | 62% |
Quality
Perplexity: 4.604 ± 0.042 (wikitext, 50 samples, 2048 seq length, with turbo4v2 KV compression)
MMLU (200q, single-pass, reasoning ON):
| Subject | Score |
|---|---|
| Abstract Algebra | 18/20 |
| Anatomy | 19/20 |
| Astronomy | 19/20 |
| College CS | 18/20 |
| College Physics | 19/20 |
| HS Biology | 20/20 |
| HS Chemistry | 17/20 |
| HS Math | 20/20 |
| Logical Fallacies | 19/20 |
| World Religions | 18/20 |
| TOTAL | 187/200 (93.5%) |
Methodology: single-pass, 200 questions (10 MMLU subjects x 20), reasoning enabled, no retries, no few-shot, evaluated with mlx_lm on Apple M5 Max 128 GB.
NIAH (Needle in a Haystack): 12/12 (100%)
| Context | 10% depth | 50% depth | 90% depth |
|---|---|---|---|
| 1.4K | ✓ | ✓ | ✓ |
| 2.4K | ✓ | ✓ | ✓ |
| 4.4K | ✓ | ✓ | ✓ |
| 8.3K | ✓ | ✓ | ✓ |
Speed (Apple M5 Max 128 GB)
All benchmarks with turbo4v2 KV compression enabled. Measured with ekryski/mlx-swift-lm (ek/tom-eric-moe-tuning branch).
Prefill
The "Bridge" column uses a native C++ prefill path that bypasses Swift overhead for 5-48% faster prompt processing, with the biggest gains at 512-1024 token prompts.
| Context | Bridge + turbo4v2 | Swift + turbo4v2 | Swift vanilla | Bridge vs Swift turbo4 |
|---|---|---|---|---|
| 128 | 199 t/s | 185 t/s | 185 t/s | +8% |
| 256 | 281 t/s | 267 t/s | 267 t/s | +5% |
| 512 | 368 t/s | 293 t/s | 293 t/s | +26% |
| 1024 | 462 t/s | 351 t/s | 351 t/s | +32% |
| 2048 | 510 t/s | 430 t/s | 430 t/s | +19% |
| 4096 | 514 t/s | 468 t/s | 468 t/s | +10% |
| 8192 | 477 t/s | 436 t/s | 436 t/s | +9% |
| 16384 | 396 t/s | 267 t/s | 267 t/s | +48% |
Note: turbo4v2 adds zero prefill overhead -Swift turbo4v2 and Swift vanilla prefill are identical.
Decode
| Context | Bridge + turbo4v2 | Swift + turbo4v2 |
|---|---|---|
| 128 | 59.2 t/s | 61.1 t/s |
| 256 | 58.7 t/s | 60.5 t/s |
| 512 | 56.6 t/s | 58.5 t/s |
| 1024 | 54.7 t/s | 57.4 t/s |
| 2048 | 53.4 t/s | 50.0 t/s |
| 4096 | 50.0 t/s | 52.1 t/s |
| 8192 | 44.4 t/s | 45.4 t/s |
| 16384 | 37.3 t/s | 36.9 t/s |
Decode is comparable between Bridge and Swift -both paths hit ~61 tok/s at short context and degrade gracefully to ~37 tok/s at 16K.
TurboQuant KV Cache Compression
With Config-I, the model weights are only 87 GB -leaving ~36 GB free on a 128 GB Mac. At that point, KV cache is the bottleneck, not the model. A 32K conversation in bf16 eats 7.9 GB of that headroom. turbo4v2 compresses that to 1.5 GB (5.3x, 81% saved), turning the remaining memory into usable context instead of wasted KV overhead. This is where Config-I + turbo4v2 stacking matters most: the smaller the model, the more context you can reclaim.
| Context | bf16 KV | turbo4v2 KV | Saved |
|---|---|---|---|
| 8K | 7.9 GB | 1.5 GB | 6.4 GB |
| 16K | 15.8 GB | 3.0 GB | 12.8 GB |
| 32K | 31.6 GB | 6.0 GB | 25.6 GB |
| 64K | 63.2 GB | 11.9 GB | 51.3 GB |
| 128K | 126.4 GB | 23.9 GB | 102.5 GB |
Max context on 128 GB M5 Max (87 GB model, ~36 GB free):
- bf16 KV: 149K tokens
- turbo4v2 KV: 595K tokens (4x more context)
The full package: Config-I weights (62% smaller) + turbo4v2 KV (81% smaller) + Bridge prefill (5-48% faster). PPL of 4.604 measured with everything stacked -no additional quality penalty.
Config-I Policy (MiniMax M2.7 Adaptation)
| Component | Bits | Layers | Rationale |
|---|---|---|---|
| Expert MLP gate/up | 2-bit | middle 58 | 98%+ of params, MoE-tolerant |
| Expert MLP down | 3-bit | middle 58 | Write-back sensitivity (Config-I finding) |
| Attention Q/K/V/O | 4-bit | middle 58 | Uniform per layer |
| Boundary (all tensors) | 8-bit | first 2 + last 2 | Boundary layer protection |
| MoE router | f16 | all | Routing precision critical |
| Embeddings + lm_head | 8-bit | - | Protected |
Uniform MLX quantization produces broken output (~25% MMLU, random guessing) on MiniMax at all bit levels because it compresses attention and routing to the same bits as expert MLPs. Config-I solves this by protecting the components that control coherence while compressing the 98% of parameters that tolerate it.
Compatibility
| Field | Value |
|---|---|
| Format | MLX safetensors (standard) |
| Avg bits | 3.249 bpw |
| Runtime | mlx_lm (Python), mlx-swift-lm (Swift) |
| Platform | Apple Silicon (recommended M-series Pro/Max/Ultra with 96GB+) |
| Quantized on | 2026-04-12 |
No custom loader needed. This is standard MLX per-layer quantization. Any tool that reads MLX safetensors with config.json quantization metadata will work.
How to Run
Python (mlx_lm)
pip install mlx-lm
python -m mlx_lm.generate --model thetom-ai/MiniMax-M2.7-ConfigI-MLX --prompt "Hello"
from mlx_lm import load, generate
model, tokenizer = load("thetom-ai/MiniMax-M2.7-ConfigI-MLX")
print(generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=1.0, top_p=0.95))
Swift (mlx-swift-lm) -TurboQuant KV compression
Note: Agent connectors (Hermes, opencode, Droid) are still in progress. The Swift runtime, server, and TurboQuant KV compression all work.
For the speed and KV compression results above, use ekryski/mlx-swift-lm.
In code:
import MLXLLM
let container = try await LLMModelFactory.shared.loadContainer(
configuration: ModelConfiguration(id: "thetom-ai/MiniMax-M2.7-ConfigI-MLX"))
let result = try await container.generate(
input: .init(text: .init(tokens: tokenArray)),
parameters: GenerateParameters(temperature: 1.0))
As an OpenAI-compatible server:
git clone https://github.com/ekryski/mlx-swift-lm.git
cd mlx-swift-lm
git checkout ek/tom-eric-moe-tuning
swift build -c release
# Download the model
hf download thetom-ai/MiniMax-M2.7-ConfigI-MLX --local-dir ~/models/MiniMax-M2.7-ConfigI-MLX
# Run server
.build/release/MLXServer --model ~/models/MiniMax-M2.7-ConfigI-MLX --port 8080
# Test
curl http://localhost:8080/v1/chat/completions -X POST -H "Content-Type: application/json" \
-d '{"model":"local","messages":[{"role":"user","content":"Hello"}],"max_tokens":256,"temperature":1.0}'
Important: MiniMax M2.7 is an always-reasoning model. Use
temperature=1.0-greedy/temp=0 causes infinite thinking loops.
Hermes AI Agent
With the MLXServer running on port 8080, add this to ~/.hermes/config.yaml:
model:
default: local
provider: custom
base_url: http://localhost:8080/v1
context_length: 196608
Then just run hermes. It will use whatever model is loaded on the server.
What is Config-I?
Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation, it was discovered that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: compression policy matters more than compression math -which tensors to compress, which to protect, and how aggressively.
Config-I achieves 27-38% size reduction at +1.0-3.9% PPL across Qwen and Phi model families (1.5B to 72B), validated by independent third-party implementations.
For MoE models like MiniMax M2.7, expert MLPs dominate parameter count but tolerate aggressive compression because only 8 of 256 experts are active per token. Config-I exploits this by compressing expert MLPs to 2-3 bit while protecting attention and routing at higher precision.
Quantized by @thetom-ai | GitHub | X | Sponsor
- Downloads last month
- 977
4-bit
Model tree for thetom-ai/MiniMax-M2.7-ConfigI-MLX
Base model
MiniMaxAI/MiniMax-M2.7