Phi-4-mini-reasoning (BitsAndBytes 4-bit NF4 Quantized)
This repository contains a 4-bit quantized version of microsoft/Phi-4-mini-reasoning, produced with BitsAndBytes via Hugging Face Transformers.
Quantization reduces VRAM usage while preserving most of the model鈥檚 reasoning capabilities.
Model Details
Model Description
- Developed by (base model): Microsoft
- Shared by (quantized version): KavinduHansaka
- Model type: Causal Language Model (decoder-only transformer)
- Context length: 128K
- Language(s): English
- License: MIT (inherited from base model)
- Finetuned from: microsoft/Phi-4-mini-reasoning
Model Sources
- Repository (quantized): KavinduHansaka/phi4-mini-bnb-4bit
- Repository (base model): microsoft/Phi-4-mini-reasoning
Uses
Direct Use
- Text and reasoning generation
- Educational and research experiments
- Running inference on lower-VRAM GPUs
Downstream Use
- Can be fine-tuned further for domain-specific reasoning tasks
- Integrated into chatbots, assistants, and research pipelines
Out-of-Scope Use
- Do not use for generating harmful, biased, or unsafe content
- Not recommended for high-accuracy production systems without further testing
Bias, Risks, and Limitations
- As with the base model, it may produce biased or incorrect content.
- Quantization may reduce numerical precision, which can slightly affect reasoning quality.
- Long-context reasoning (128k tokens) may still be resource-intensive.
Recommendations
- Apply appropriate safety filters before deploying in production.
- Be aware that outputs are not guaranteed to be factually correct.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
model_id = "KavinduHansaka/phi4-mini-bnb-4bit"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
dtype=torch.bfloat16,
quantization_config=bnb_config
)
inputs = tokenizer("Explain why the sky is blue in simple terms.", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
This model inherits training data from microsoft/Phi-4-mini-reasoning. No additional fine-tuning was done.
- Quantization method: BitsAndBytes 4-bit (NF4, double quantization)
- Precision: bfloat16 compute
- Original precision: fp16
Technical Specifications
- Architecture: Decoder-only transformer
- Parameters: Same as Phi-4-mini-reasoning
- Quantization: 4-bit NF4
Citation
If you use this quantized model, please also cite the original Microsoft release:
@misc{microsoft2025phi4mini,
title={Phi-4-mini-reasoning},
author={Microsoft},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/microsoft/Phi-4-mini-reasoning}}
}
Model Card Authors
- Quantized version shared by KavinduHansaka
- Base model by Microsoft
Model Card Contact
- For issues/questions with this quantized release: open a discussion on KavinduHansaka/phi4-mini-bnb-4bit.
- For base model details: see microsoft/Phi-4-mini-reasoning.
- Downloads last month
- 6
Model tree for KavinduHansaka/phi4-mini-bnb-4bit
Base model
microsoft/Phi-4-mini-reasoning