Phi-4-mini-reasoning (BitsAndBytes 4-bit NF4 Quantized)

This repository contains a 4-bit quantized version of microsoft/Phi-4-mini-reasoning, produced with BitsAndBytes via Hugging Face Transformers.
Quantization reduces VRAM usage while preserving most of the model鈥檚 reasoning capabilities.


Model Details

Model Description

  • Developed by (base model): Microsoft
  • Shared by (quantized version): KavinduHansaka
  • Model type: Causal Language Model (decoder-only transformer)
  • Context length: 128K
  • Language(s): English
  • License: MIT (inherited from base model)
  • Finetuned from: microsoft/Phi-4-mini-reasoning

Model Sources


Uses

Direct Use

  • Text and reasoning generation
  • Educational and research experiments
  • Running inference on lower-VRAM GPUs

Downstream Use

  • Can be fine-tuned further for domain-specific reasoning tasks
  • Integrated into chatbots, assistants, and research pipelines

Out-of-Scope Use

  • Do not use for generating harmful, biased, or unsafe content
  • Not recommended for high-accuracy production systems without further testing

Bias, Risks, and Limitations

  • As with the base model, it may produce biased or incorrect content.
  • Quantization may reduce numerical precision, which can slightly affect reasoning quality.
  • Long-context reasoning (128k tokens) may still be resource-intensive.

Recommendations

  • Apply appropriate safety filters before deploying in production.
  • Be aware that outputs are not guaranteed to be factually correct.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "KavinduHansaka/phi4-mini-bnb-4bit"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype=torch.bfloat16,
    quantization_config=bnb_config
)

inputs = tokenizer("Explain why the sky is blue in simple terms.", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

This model inherits training data from microsoft/Phi-4-mini-reasoning. No additional fine-tuning was done.

  • Quantization method: BitsAndBytes 4-bit (NF4, double quantization)
  • Precision: bfloat16 compute
  • Original precision: fp16

Technical Specifications

  • Architecture: Decoder-only transformer
  • Parameters: Same as Phi-4-mini-reasoning
  • Quantization: 4-bit NF4

Citation

If you use this quantized model, please also cite the original Microsoft release:

@misc{microsoft2025phi4mini,
  title={Phi-4-mini-reasoning},
  author={Microsoft},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/microsoft/Phi-4-mini-reasoning}}
}

Model Card Authors

  • Quantized version shared by KavinduHansaka
  • Base model by Microsoft

Model Card Contact

Downloads last month
6
Safetensors
Model size
4B params
Tensor type
F32
BF16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for KavinduHansaka/phi4-mini-bnb-4bit

Quantized
(35)
this model