MiniMax-M2.7 -TurboQuant+ Config-I (MLX)

93.5% MMLU at 87 GB. 61 tok/s decode. PPL 4.604. 228B-parameter MoE compressed 62% with Config-I mixed-precision quantization. Standard MLX format -works with stock mlx_lm and mlx-swift-lm. No custom loaders required.

Config-I quantization of MiniMaxAI/MiniMax-M2.7 (228.7B total, ~1.4B active per token). The policy applies aggressive 2-bit compression to expert MLPs (where MoE is most tolerant), protects attention at 4-bit, and shields boundary layers and routing at full precision. See the Config-I paper for the policy derivation.

Compression

	Size
FP8 source	230 GB
Config-I (3.25 bpw)	87 GB
Reduction	62%

Quality

Perplexity: 4.604 ± 0.042 (wikitext, 50 samples, 2048 seq length, with turbo4v2 KV compression)

MMLU (200q, single-pass, reasoning ON):

Subject	Score
Abstract Algebra	18/20
Anatomy	19/20
Astronomy	19/20
College CS	18/20
College Physics	19/20
HS Biology	20/20
HS Chemistry	17/20
HS Math	20/20
Logical Fallacies	19/20
World Religions	18/20
TOTAL	187/200 (93.5%)

Methodology: single-pass, 200 questions (10 MMLU subjects x 20), reasoning enabled, no retries, no few-shot, evaluated with mlx_lm on Apple M5 Max 128 GB.

NIAH (Needle in a Haystack): 12/12 (100%)

Context	10% depth	50% depth	90% depth
1.4K	✓	✓	✓
2.4K	✓	✓	✓
4.4K	✓	✓	✓
8.3K	✓	✓	✓

Speed (Apple M5 Max 128 GB)

All benchmarks with turbo4v2 KV compression enabled. Measured with ekryski/mlx-swift-lm (ek/tom-eric-moe-tuning branch).

Prefill

The "Bridge" column uses a native C++ prefill path that bypasses Swift overhead for 5-48% faster prompt processing, with the biggest gains at 512-1024 token prompts.

Context	Bridge + turbo4v2	Swift + turbo4v2	Swift vanilla	Bridge vs Swift turbo4
128	199 t/s	185 t/s	185 t/s	+8%
256	281 t/s	267 t/s	267 t/s	+5%
512	368 t/s	293 t/s	293 t/s	+26%
1024	462 t/s	351 t/s	351 t/s	+32%
2048	510 t/s	430 t/s	430 t/s	+19%
4096	514 t/s	468 t/s	468 t/s	+10%
8192	477 t/s	436 t/s	436 t/s	+9%
16384	396 t/s	267 t/s	267 t/s	+48%

Note: turbo4v2 adds zero prefill overhead -Swift turbo4v2 and Swift vanilla prefill are identical.

Decode

Context	Bridge + turbo4v2	Swift + turbo4v2
128	59.2 t/s	61.1 t/s
256	58.7 t/s	60.5 t/s
512	56.6 t/s	58.5 t/s
1024	54.7 t/s	57.4 t/s
2048	53.4 t/s	50.0 t/s
4096	50.0 t/s	52.1 t/s
8192	44.4 t/s	45.4 t/s
16384	37.3 t/s	36.9 t/s

Decode is comparable between Bridge and Swift -both paths hit ~61 tok/s at short context and degrade gracefully to ~37 tok/s at 16K.

TurboQuant KV Cache Compression

With Config-I, the model weights are only 87 GB -leaving ~36 GB free on a 128 GB Mac. At that point, KV cache is the bottleneck, not the model. A 32K conversation in bf16 eats 7.9 GB of that headroom. turbo4v2 compresses that to 1.5 GB (5.3x, 81% saved), turning the remaining memory into usable context instead of wasted KV overhead. This is where Config-I + turbo4v2 stacking matters most: the smaller the model, the more context you can reclaim.

Context	bf16 KV	turbo4v2 KV	Saved
8K	7.9 GB	1.5 GB	6.4 GB
16K	15.8 GB	3.0 GB	12.8 GB
32K	31.6 GB	6.0 GB	25.6 GB
64K	63.2 GB	11.9 GB	51.3 GB
128K	126.4 GB	23.9 GB	102.5 GB

Max context on 128 GB M5 Max (87 GB model, ~36 GB free):

bf16 KV: 149K tokens
turbo4v2 KV: 595K tokens (4x more context)

The full package: Config-I weights (62% smaller) + turbo4v2 KV (81% smaller) + Bridge prefill (5-48% faster). PPL of 4.604 measured with everything stacked -no additional quality penalty.

Config-I Policy (MiniMax M2.7 Adaptation)

Component	Bits	Layers	Rationale
Expert MLP gate/up	2-bit	middle 58	98%+ of params, MoE-tolerant
Expert MLP down	3-bit	middle 58	Write-back sensitivity (Config-I finding)
Attention Q/K/V/O	4-bit	middle 58	Uniform per layer
Boundary (all tensors)	8-bit	first 2 + last 2	Boundary layer protection
MoE router	f16	all	Routing precision critical
Embeddings + lm_head	8-bit	-	Protected

Uniform MLX quantization produces broken output (~25% MMLU, random guessing) on MiniMax at all bit levels because it compresses attention and routing to the same bits as expert MLPs. Config-I solves this by protecting the components that control coherence while compressing the 98% of parameters that tolerate it.

Compatibility

Field	Value
Format	MLX safetensors (standard)
Avg bits	3.249 bpw
Runtime	`mlx_lm` (Python), `mlx-swift-lm` (Swift)
Platform	Apple Silicon (recommended M-series Pro/Max/Ultra with 96GB+)
Quantized on	2026-04-12

No custom loader needed. This is standard MLX per-layer quantization. Any tool that reads MLX safetensors with config.json quantization metadata will work.

How to Run

Python (mlx_lm)

pip install mlx-lm
python -m mlx_lm.generate --model thetom-ai/MiniMax-M2.7-ConfigI-MLX --prompt "Hello"

from mlx_lm import load, generate
model, tokenizer = load("thetom-ai/MiniMax-M2.7-ConfigI-MLX")
print(generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=1.0, top_p=0.95))

Swift (mlx-swift-lm) -TurboQuant KV compression

Note: Agent connectors (Hermes, opencode, Droid) are still in progress. The Swift runtime, server, and TurboQuant KV compression all work.

For the speed and KV compression results above, use ekryski/mlx-swift-lm.

In code:

import MLXLLM

let container = try await LLMModelFactory.shared.loadContainer(
    configuration: ModelConfiguration(id: "thetom-ai/MiniMax-M2.7-ConfigI-MLX"))

let result = try await container.generate(
    input: .init(text: .init(tokens: tokenArray)),
    parameters: GenerateParameters(temperature: 1.0))

As an OpenAI-compatible server:

git clone https://github.com/ekryski/mlx-swift-lm.git
cd mlx-swift-lm
git checkout ek/tom-eric-moe-tuning
swift build -c release

# Download the model
hf download thetom-ai/MiniMax-M2.7-ConfigI-MLX --local-dir ~/models/MiniMax-M2.7-ConfigI-MLX

# Run server
.build/release/MLXServer --model ~/models/MiniMax-M2.7-ConfigI-MLX --port 8080

# Test
curl http://localhost:8080/v1/chat/completions -X POST -H "Content-Type: application/json" \
  -d '{"model":"local","messages":[{"role":"user","content":"Hello"}],"max_tokens":256,"temperature":1.0}'

Important: MiniMax M2.7 is an always-reasoning model. Use temperature=1.0 -greedy/temp=0 causes infinite thinking loops.

Hermes AI Agent

With the MLXServer running on port 8080, add this to ~/.hermes/config.yaml:

model:
  default: local
  provider: custom
  base_url: http://localhost:8080/v1
  context_length: 196608

Then just run hermes. It will use whatever model is loaded on the server.

What is Config-I?

Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation, it was discovered that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight: compression policy matters more than compression math -which tensors to compress, which to protect, and how aggressively.

Config-I achieves 27-38% size reduction at +1.0-3.9% PPL across Qwen and Phi model families (1.5B to 72B), validated by independent third-party implementations.

For MoE models like MiniMax M2.7, expert MLPs dominate parameter count but tolerate aggressive compression because only 8 of 256 experts are active per token. Config-I exploits this by compressing expert MLPs to 2-3 bit while protecting attention and routing at higher precision.

Quantized by @thetom-ai | GitHub | X | Sponsor

Downloads last month: 977

Safetensors

Model size

229B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Model tree for thetom-ai/MiniMax-M2.7-ConfigI-MLX

Base model

MiniMaxAI/MiniMax-M2.7

Quantized

(65)

this model