MiniMax-M2.7 -TurboQuant+ Config-I (MLX)

93.5% MMLU at 87 GB. 61 tok/s decode. PPL 4.604. 228B-parameter MoE compressed 62% with Config-I mixed-precision quantization. Standard MLX format -works with stock mlx_lm and mlx-swift-lm. No custom loaders required.

Config-I quantization of MiniMaxAI/MiniMax-M2.7 (228.7B total, ~1.4B active per token). The policy applies aggressive 2-bit compression to expert MLPs (where MoE is most tolerant), protects attention at 4-bit, and shields boundary layers and routing at full precision. See the Config-I paper for the policy derivation.

Compression

Size
FP8 source 230 GB
Config-I (3.25 bpw) 87 GB
Reduction 62%

Quality

Perplexity: 4.604 ± 0.042 (wikitext, 50 samples, 2048 seq length, with turbo4v2 KV compression)

MMLU (200q, single-pass, reasoning ON):

Subject Score
Abstract Algebra 18/20
Anatomy 19/20
Astronomy 19/20
College CS 18/20
College Physics 19/20
HS Biology 20/20
HS Chemistry 17/20
HS Math 20/20
Logical Fallacies 19/20
World Religions 18/20
TOTAL 187/200 (93.5%)

Methodology: single-pass, 200 questions (10 MMLU subjects x 20), reasoning enabled, no retries, no few-shot, evaluated with mlx_lm on Apple M5 Max 128 GB.

NIAH (Needle in a Haystack): 12/12 (100%)

Context 10% depth 50% depth 90% depth
1.4K
2.4K
4.4K
8.3K

Speed (Apple M5 Max 128 GB)

All benchmarks with turbo4v2 KV compression enabled. Measured with ekryski/mlx-swift-lm (ek/tom-eric-moe-tuning branch).

Prefill

The "Bridge" column uses a native C++ prefill path that bypasses Swift overhead for 5-48% faster prompt processing, with the biggest gains at 512-1024 token prompts.

Context Bridge + turbo4v2 Swift + turbo4v2 Swift vanilla Bridge vs Swift turbo4
128 199 t/s 185 t/s 185 t/s +8%
256 281 t/s 267 t/s 267 t/s +5%
512 368 t/s 293 t/s 293 t/s +26%
1024 462 t/s 351 t/s 351 t/s +32%
2048 510 t/s 430 t/s 430 t/s +19%
4096 514 t/s 468 t/s 468 t/s +10%
8192 477 t/s 436 t/s 436 t/s +9%
16384 396 t/s 267 t/s 267 t/s +48%

Note: turbo4v2 adds zero prefill overhead -Swift turbo4v2 and Swift vanilla prefill are identical.

Decode

Context Bridge + turbo4v2 Swift + turbo4v2
128 59.2 t/s 61.1 t/s
256 58.7 t/s 60.5 t/s
512 56.6 t/s 58.5 t/s
1024 54.7 t/s 57.4 t/s
2048 53.4 t/s 50.0 t/s
4096 50.0 t/s 52.1 t/s
8192 44.4 t/s 45.4 t/s
16384 37.3 t/s 36.9 t/s

Decode is comparable between Bridge and Swift -both paths hit ~61 tok/s at short context and degrade gracefully to ~37 tok/s at 16K.

TurboQuant KV Cache Compression

With Config-I, the model weights are only 87 GB -leaving ~36 GB free on a 128 GB Mac. At that point, KV cache is the bottleneck, not the model. A 32K conversation in bf16 eats 7.9 GB of that headroom. turbo4v2 compresses that to 1.5 GB (5.3x, 81% saved), turning the remaining memory into usable context instead of wasted KV overhead. This is where Config-I + turbo4v2 stacking matters most: the smaller the model, the more context you can reclaim.

Context bf16 KV turbo4v2 KV Saved
8K 7.9 GB 1.5 GB 6.4 GB
16K 15.8 GB 3.0 GB 12.8 GB
32K 31.6 GB 6.0 GB 25.6 GB
64K 63.2 GB 11.9 GB 51.3 GB
128K 126.4 GB 23.9 GB 102.5 GB

Max context on 128 GB M5 Max (87 GB model, ~36 GB free):

  • bf16 KV: 149K tokens
  • turbo4v2 KV: 595K tokens (4x more context)

The full package: Config-I weights (62% smaller) + turbo4v2 KV (81% smaller) + Bridge prefill (5-48% faster). PPL of 4.604 measured with everything stacked -no additional quality penalty.

Config-I Policy (MiniMax M2.7 Adaptation)

Component Bits Layers Rationale
Expert MLP gate/up 2-bit middle 58 98%+ of params, MoE-tolerant
Expert MLP down 3-bit middle 58 Write-back sensitivity (Config-I finding)
Attention Q/K/V/O 4-bit middle 58 Uniform per layer
Boundary (all tensors) 8-bit first 2 + last 2 Boundary layer protection
MoE router f16 all Routing precision critical
Embeddings + lm_head 8-bit - Protected

Uniform MLX quantization produces broken output (~25% MMLU, random guessing) on MiniMax at all bit levels because it compresses attention and routing to the same bits as expert MLPs. Config-I solves this by protecting the components that control coherence while compressing the 98% of parameters that tolerate it.

Compatibility

Field Value
Format MLX safetensors (standard)
Avg bits 3.249 bpw
Runtime mlx_lm (Python), mlx-swift-lm (Swift)
Platform Apple Silicon (recommended M-series Pro/Max/Ultra with 96GB+)
Quantized on 2026-04-12

No custom loader needed. This is standard MLX per-layer quantization. Any tool that reads MLX safetensors with config.json quantization metadata will work.

How to Run

Python (mlx_lm)

pip install mlx-lm
python -m mlx_lm.generate --model thetom-ai/MiniMax-M2.7-ConfigI-MLX --prompt "Hello"
from mlx_lm import load, generate
model, tokenizer = load("thetom-ai/MiniMax-M2.7-ConfigI-MLX")
print(generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=1.0, top_p=0.95))

Swift (mlx-swift-lm) -TurboQuant KV compression

Note: Agent connectors (Hermes, opencode, Droid) are still in progress. The Swift runtime, server, and TurboQuant KV compression all work.

For the speed and KV compression results above, use ekryski/mlx-swift-lm.

In code:

import MLXLLM

let container = try await LLMModelFactory.shared.loadContainer(
    configuration: ModelConfiguration(id: "thetom-ai/MiniMax-M2.7-ConfigI-MLX"))

let result = try await container.generate(
    input: .init(text: .init(tokens: tokenArray)),
    parameters: GenerateParameters(temperature: 1.0))

As an OpenAI-compatible server:

git clone https://github.com/ekryski/mlx-swift-lm.git
cd mlx-swift-lm
git checkout ek/tom-eric-moe-tuning
swift build -c release

# Download the model
hf download thetom-ai/MiniMax-M2.7-ConfigI-MLX --local-dir ~/models/MiniMax-M2.7-ConfigI-MLX

# Run server
.build/release/MLXServer --model ~/models/MiniMax-M2.7-ConfigI-MLX --port 8080

# Test
curl http://localhost:8080/v1/chat/completions -X POST -H "Content-Type: application/json" \
  -d '{"model":"local","messages":[{"role":"user","content":"Hello"}],"max_tokens":256,"temperature":1.0}'

Important: MiniMax M2.7 is an always-reasoning model. Use temperature=1.0 -greedy/temp=0 causes infinite thinking loops.

Hermes AI Agent

With the MLXServer running on port 8080, add this to ~/.hermes/config.yaml:

model:
  default: local
  provider: custom
  base_url: http://localhost:8080/v1
  context_length: 196608

Then just run hermes. It will use whatever model is loaded on the server.

What is Config-I?

Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation, it was discovered that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: compression policy matters more than compression math -which tensors to compress, which to protect, and how aggressively.

Config-I achieves 27-38% size reduction at +1.0-3.9% PPL across Qwen and Phi model families (1.5B to 72B), validated by independent third-party implementations.

For MoE models like MiniMax M2.7, expert MLPs dominate parameter count but tolerate aggressive compression because only 8 of 256 experts are active per token. Config-I exploits this by compressing expert MLPs to 2-3 bit while protecting attention and routing at higher precision.


Quantized by @thetom-ai | GitHub | X | Sponsor

Downloads last month
977
Safetensors
Model size
229B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thetom-ai/MiniMax-M2.7-ConfigI-MLX

Quantized
(65)
this model