Instructions to use mlx-community/gemma-4-e2b-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/gemma-4-e2b-4bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir gemma-4-e2b-4bit mlx-community/gemma-4-e2b-4bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
⚠️ Existing MLX quantized Gemma 4 models (mlx-community, unsloth) produce garbage output due to quantizing PLE (Per-Layer Embedding) layers.
Fixed 2 days ago during the release window: https://github.com/Blaizzy/mlx-vlm/pull/893
This works perfectly fine for me:
pip install --upgrade mlx-vlm
mlx_vlm.generate --model mlx-community/gemma-4-e2b-it-bf16 --prompt "Who are you?"
Works for bf16, doesn't work for quantization. See http://github.com/jundot/omlx/issues/534 nvm! seems to work!
be carefule, it took me 50G ram
Also converted the REAP-pruned variants (21B and 19B) to PLE-safe MLX 4-bit - both validated with vision + multilingual chat working correctly.
REAP-21B (13.9 GB) actually outscores the full 26B 4-bit on several benchmarks despite being smaller.
You can also convert any gemma4 variant yourself using the scripts in the repo - just point convert_gemma4.py at the source model. Built on FakeRocket543's PLE-safe quantization work.
Models: https://huggingface.co/ukint-vs/gemma-4-21b-a4b-it-REAP-MLX-4bit
Conversion scripts + benchmark results: https://github.com/ukint-vs/mlx-gemma4-reap