Configuration Parsing Warning: Invalid JSON for config file config.json

NVIDIA Nemotron-3-Nano-30B-A3B - TevunahAi Ultra-Hybrid GPTQ

Model Details

Property	Value
Base Model	NVIDIA Nemotron-3-Nano-30B-A3B-BF16
Architecture	HYBRID (Mamba-2 + MoE + GQA)
Parameters	30B total, ~3.5B active
Context Length	128K (up to 1M supported)
Quantization	TevunahAi Ultra-Hybrid GPTQ
Original Size	60.23 GB (BF16)
Quantized Size	18.39 GB
Compression	68.73% reduction

Architecture Breakdown

NVIDIA Nemotron-3-Nano uses one of the most sophisticated architectures available:

Layer Composition (52 total layers)

23 Mamba-2 Layers: State Space Models with selective state spaces
- Linear time complexity O(n) vs O(n²) for attention
- Excellent for long-range dependencies
- Large intermediate tensors (memory intensive)
23 MoE Layers: Mixture-of-Experts
- 128 routed experts per layer (2,944 total expert modules!)
- 1 shared expert per layer (always active)
- Top-6 routing: 6 experts activated per token
- Massive capacity with sparse activation
6 GQA Attention Layers: Grouped Query Attention
- 32 attention heads, 2 key-value heads
- Strategic placement for quality

Why This Matters

30B total parameters but only ~3.5B active per token
Exceptional efficiency: near-3B inference cost, 30B capacity
State-of-the-art reasoning: 99.2% on AIME25 (with tools)

Quantization Strategy

TevunahAi Ultra-Hybrid Mixed-Precision:

Component	Precision	Rationale
Router/Gate	FP32	Maximum routing precision - critical for MoE
GQA Attention (q/k/v/o_proj)	INT8	Quality preservation for attention
Mamba-2 Projections (in/out_proj)	INT4	SSM layers
MoE Experts (128/layer)	INT4	Aggressive compression
Shared Experts	INT4	Standard compression
Embeddings & LM Head	BF16	Preserved for accuracy

Calibration

2048 samples (8x industry standard of 256)
4096 sequence length
Premium calibration for superior quality retention

Performance Benchmarks

Original Model (NVIDIA benchmarks):

Benchmark	Score
AIME25 (with tools)	99.2%
AIME25 (no tools)	89.1%
LiveCodeBench	68.3%
SWE-Bench	38.8%
RULER-100@256k	92.9%
RULER-100@1M	86.3%

Expected Quantized Performance:

Reasoning tasks: 97-99% of baseline
Code generation: 96-98% of baseline
Long context: 95-98% of baseline (Mamba advantage)
General chat: 98-99% of baseline

Formal benchmarks pending - inference quality verified manually.

Usage

GPTQModel (Recommended):

import gptqmodel.models.loader as gptq_loader

# Bypass version check for NemotronH (required for transformers > 4.48)
_original_check_versions = gptq_loader.check_versions
def patched_check_versions(model_class, require_pkgs_version):
    if 'NemotronH' in str(model_class):
        return
    return _original_check_versions(model_class, require_pkgs_version)
gptq_loader.check_versions = patched_check_versions

from gptqmodel import GPTQModel
from transformers import AutoTokenizer

model = GPTQModel.load(
    "TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ",
    trust_remote_code=True
)

# Generate
prompt = "The capital of France is"
output = model.generate(
    **tokenizer(prompt, return_tensors='pt').to('cuda'),
    max_new_tokens=100
)
print(tokenizer.decode(output[0]))

With Chat Template:

messages = [{"role": "user", "content": "Solve: What is 23 * 47?"}]

tokenized = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to('cuda')

outputs = model.generate(
    tokenized,
    max_new_tokens=1024,
    temperature=1.0,  # Recommended for reasoning
    top_p=1.0
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Disable Reasoning (faster, less accurate for hard problems):

tokenized = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    enable_thinking=False,  # Disable reasoning traces
    add_generation_prompt=True,
    return_tensors="pt"
).to('cuda')

outputs = model.generate(
    tokenized,
    max_new_tokens=32,
    do_sample=False,  # Greedy for non-reasoning
    num_beams=1
)

Installation:

pip install gptqmodel transformers>=4.48

vLLM (Experimental):

pip install -U "vllm>=0.12.0"

vllm serve TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ \
    --max-num-seqs 8 \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --trust-remote-code

Known Issues

Transformers native loading: Due to the novel hybrid architecture (Mamba-2 + MoE + Attention), loading via AutoModelForCausalLM.from_pretrained() may encounter compatibility issues with optimum/transformers GPTQ integration. Use GPTQModel as shown above.
Version check bypass: The version check patch is required when using transformers > 4.48.3
Tokenizer regex warning: Can be safely ignored or fixed with fix_mistral_regex=True

Memory Requirements

Inference (quantized model):

Minimum: 20GB VRAM (short context)
Recommended: 24-32GB VRAM
For 128K context: 48GB+ VRAM

Quantization (reproduction):

Minimum: 32GB VRAM + 128GB RAM
Used: RTX 5000 Ada (32GB) + Dual Xeon Max 9480 (128GB HBM2e + 256GB DDR5)

Quantization Details

Specification	Value
Method	GPTQ with Ultra-Hybrid configuration
Quantizer	GPTQModel 5.6.12
Time	268.2 minutes (4.5 hours)
Calibration Samples	2048 (8x industry standard)
Sequence Length	4096 tokens
Bits Per Weight	4.29 BPW estimated
desc_act	True (activation ordering)
sym	True (symmetric quantization)
group_size	128

Use Cases

Ideal for:

Mathematical reasoning (AIME-level problems)
Code generation & debugging (SWE-Bench capable)
Long-context analysis (up to 1M tokens)
Agentic applications (tool use, multi-step reasoning)
Resource-constrained deployment (60GB → 18GB)

Technical Specifications

Specification	Value
Model Family	NVIDIA Nemotron-3
Variant	Nano-30B-A3B (Hybrid)
Total Parameters	30B
Active Parameters	~3.5B
Total Layers	52
Mamba-2 Layers	23
MoE Layers	23
Attention Layers	6
Experts per MoE Layer	128 (+1 shared)
Top-K Routing	6
Hidden Size	2688
Intermediate Size	1856
Context Length	128K (default), 1M (max)
Vocab Size	131072
Supported Languages	EN, DE, ES, FR, IT, JA

License

NVIDIA Open Model License
https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/

Citation

@software{nemotron_nano_gptq_2024,
  title = {NVIDIA Nemotron-3-Nano-30B-A3B - TevunahAi Ultra-Hybrid GPTQ},
  author = {TevunahAi},
  year = {2024},
  note = {First GPTQ quantization of hybrid Mamba-2 + MoE + GQA architecture},
  url = {https://huggingface.co/TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ}
}

@misc{nvidia_nemotron_nano_v3_2025,
  title  = {Nemotron 3 Nano: Open, Efficient MoE Hybrid Mamba-Transformer},
  author = {NVIDIA},
  year   = {2025},
  url    = {https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf}
}

Acknowledgments

This quantization required patching GPTQModel's NemotronH model definition to properly handle:

Nested MoE expert module tree structure
Dynamic expert indexing (128 experts × 23 layers)
Mixed Mamba-2/MoE/Attention layer patterns

Special thanks to the GPTQModel team for their excellent quantization framework.

Quantized by TevunahAi
Professional AI Model Quantization Service
Specialized in hybrid architectures (Mamba, MoE, SSM)

https://huggingface.co/TevunahAi

Downloads last month: 1,450

Safetensors

Model size

6B params

Tensor type

BF16

F32

I32

F16

Model tree for TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ

Base model

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Quantized

(13)

this model

Collection including TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ

Ultra Quantization Hybrid Model.

Collection

These models are quantized in mixed precision that allows them to have a smaller footprint than fp8, but still high quality. • 6 items • Updated 10 days ago • 1

TevunahAi
/

Nemotron-3-Nano-30B-A3B-GPTQ