NVIDIA Nemotron-3-Nano-30B-A3B - TevunahAi Ultra-Hybrid GPTQ
Model Details
| Property | Value |
|---|---|
| Base Model | NVIDIA Nemotron-3-Nano-30B-A3B-BF16 |
| Architecture | HYBRID (Mamba-2 + MoE + GQA) |
| Parameters | 30B total, ~3.5B active |
| Context Length | 128K (up to 1M supported) |
| Quantization | TevunahAi Ultra-Hybrid GPTQ |
| Original Size | 60.23 GB (BF16) |
| Quantized Size | 18.39 GB |
| Compression | 68.73% reduction |
Architecture Breakdown
NVIDIA Nemotron-3-Nano uses one of the most sophisticated architectures available:
Layer Composition (52 total layers)
23 Mamba-2 Layers: State Space Models with selective state spaces
- Linear time complexity O(n) vs O(n²) for attention
- Excellent for long-range dependencies
- Large intermediate tensors (memory intensive)
23 MoE Layers: Mixture-of-Experts
- 128 routed experts per layer (2,944 total expert modules!)
- 1 shared expert per layer (always active)
- Top-6 routing: 6 experts activated per token
- Massive capacity with sparse activation
6 GQA Attention Layers: Grouped Query Attention
- 32 attention heads, 2 key-value heads
- Strategic placement for quality
Why This Matters
- 30B total parameters but only ~3.5B active per token
- Exceptional efficiency: near-3B inference cost, 30B capacity
- State-of-the-art reasoning: 99.2% on AIME25 (with tools)
Quantization Strategy
TevunahAi Ultra-Hybrid Mixed-Precision:
| Component | Precision | Rationale |
|---|---|---|
| Router/Gate | FP32 | Maximum routing precision - critical for MoE |
| GQA Attention (q/k/v/o_proj) | INT8 | Quality preservation for attention |
| Mamba-2 Projections (in/out_proj) | INT4 | SSM layers |
| MoE Experts (128/layer) | INT4 | Aggressive compression |
| Shared Experts | INT4 | Standard compression |
| Embeddings & LM Head | BF16 | Preserved for accuracy |
Calibration
- 2048 samples (8x industry standard of 256)
- 4096 sequence length
- Premium calibration for superior quality retention
Performance Benchmarks
Original Model (NVIDIA benchmarks):
| Benchmark | Score |
|---|---|
| AIME25 (with tools) | 99.2% |
| AIME25 (no tools) | 89.1% |
| LiveCodeBench | 68.3% |
| SWE-Bench | 38.8% |
| RULER-100@256k | 92.9% |
| RULER-100@1M | 86.3% |
Expected Quantized Performance:
- Reasoning tasks: 97-99% of baseline
- Code generation: 96-98% of baseline
- Long context: 95-98% of baseline (Mamba advantage)
- General chat: 98-99% of baseline
Formal benchmarks pending - inference quality verified manually.
Usage
GPTQModel (Recommended):
import gptqmodel.models.loader as gptq_loader
# Bypass version check for NemotronH (required for transformers > 4.48)
_original_check_versions = gptq_loader.check_versions
def patched_check_versions(model_class, require_pkgs_version):
if 'NemotronH' in str(model_class):
return
return _original_check_versions(model_class, require_pkgs_version)
gptq_loader.check_versions = patched_check_versions
from gptqmodel import GPTQModel
from transformers import AutoTokenizer
model = GPTQModel.load(
"TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ",
trust_remote_code=True
)
# Generate
prompt = "The capital of France is"
output = model.generate(
**tokenizer(prompt, return_tensors='pt').to('cuda'),
max_new_tokens=100
)
print(tokenizer.decode(output[0]))
With Chat Template:
messages = [{"role": "user", "content": "Solve: What is 23 * 47?"}]
tokenized = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to('cuda')
outputs = model.generate(
tokenized,
max_new_tokens=1024,
temperature=1.0, # Recommended for reasoning
top_p=1.0
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Disable Reasoning (faster, less accurate for hard problems):
tokenized = tokenizer.apply_chat_template(
messages,
tokenize=True,
enable_thinking=False, # Disable reasoning traces
add_generation_prompt=True,
return_tensors="pt"
).to('cuda')
outputs = model.generate(
tokenized,
max_new_tokens=32,
do_sample=False, # Greedy for non-reasoning
num_beams=1
)
Installation:
pip install gptqmodel transformers>=4.48
vLLM (Experimental):
pip install -U "vllm>=0.12.0"
vllm serve TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ \
--max-num-seqs 8 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--trust-remote-code
Known Issues
- Transformers native loading: Due to the novel hybrid architecture (Mamba-2 + MoE + Attention), loading via
AutoModelForCausalLM.from_pretrained()may encounter compatibility issues with optimum/transformers GPTQ integration. Use GPTQModel as shown above. - Version check bypass: The version check patch is required when using transformers > 4.48.3
- Tokenizer regex warning: Can be safely ignored or fixed with
fix_mistral_regex=True
Memory Requirements
Inference (quantized model):
- Minimum: 20GB VRAM (short context)
- Recommended: 24-32GB VRAM
- For 128K context: 48GB+ VRAM
Quantization (reproduction):
- Minimum: 32GB VRAM + 128GB RAM
- Used: RTX 5000 Ada (32GB) + Dual Xeon Max 9480 (128GB HBM2e + 256GB DDR5)
Quantization Details
| Specification | Value |
|---|---|
| Method | GPTQ with Ultra-Hybrid configuration |
| Quantizer | GPTQModel 5.6.12 |
| Time | 268.2 minutes (4.5 hours) |
| Calibration Samples | 2048 (8x industry standard) |
| Sequence Length | 4096 tokens |
| Bits Per Weight | 4.29 BPW estimated |
| desc_act | True (activation ordering) |
| sym | True (symmetric quantization) |
| group_size | 128 |
Use Cases
Ideal for:
- Mathematical reasoning (AIME-level problems)
- Code generation & debugging (SWE-Bench capable)
- Long-context analysis (up to 1M tokens)
- Agentic applications (tool use, multi-step reasoning)
- Resource-constrained deployment (60GB → 18GB)
Technical Specifications
| Specification | Value |
|---|---|
| Model Family | NVIDIA Nemotron-3 |
| Variant | Nano-30B-A3B (Hybrid) |
| Total Parameters | 30B |
| Active Parameters | ~3.5B |
| Total Layers | 52 |
| Mamba-2 Layers | 23 |
| MoE Layers | 23 |
| Attention Layers | 6 |
| Experts per MoE Layer | 128 (+1 shared) |
| Top-K Routing | 6 |
| Hidden Size | 2688 |
| Intermediate Size | 1856 |
| Context Length | 128K (default), 1M (max) |
| Vocab Size | 131072 |
| Supported Languages | EN, DE, ES, FR, IT, JA |
License
NVIDIA Open Model License
https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
Citation
@software{nemotron_nano_gptq_2024,
title = {NVIDIA Nemotron-3-Nano-30B-A3B - TevunahAi Ultra-Hybrid GPTQ},
author = {TevunahAi},
year = {2024},
note = {First GPTQ quantization of hybrid Mamba-2 + MoE + GQA architecture},
url = {https://huggingface.co/TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ}
}
@misc{nvidia_nemotron_nano_v3_2025,
title = {Nemotron 3 Nano: Open, Efficient MoE Hybrid Mamba-Transformer},
author = {NVIDIA},
year = {2025},
url = {https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf}
}
Acknowledgments
This quantization required patching GPTQModel's NemotronH model definition to properly handle:
- Nested MoE expert module tree structure
- Dynamic expert indexing (128 experts × 23 layers)
- Mixed Mamba-2/MoE/Attention layer patterns
Special thanks to the GPTQModel team for their excellent quantization framework.
Quantized by TevunahAi
Professional AI Model Quantization Service
Specialized in hybrid architectures (Mamba, MoE, SSM)
- Downloads last month
- 1,450
Model tree for TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ
Base model
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16