Configuration Parsing Warning: Invalid JSON for config file config.json

NVIDIA Nemotron-3-Nano-30B-A3B - TevunahAi Ultra-Hybrid GPTQ

Model Details

Property Value
Base Model NVIDIA Nemotron-3-Nano-30B-A3B-BF16
Architecture HYBRID (Mamba-2 + MoE + GQA)
Parameters 30B total, ~3.5B active
Context Length 128K (up to 1M supported)
Quantization TevunahAi Ultra-Hybrid GPTQ
Original Size 60.23 GB (BF16)
Quantized Size 18.39 GB
Compression 68.73% reduction

Architecture Breakdown

NVIDIA Nemotron-3-Nano uses one of the most sophisticated architectures available:

Layer Composition (52 total layers)

  • 23 Mamba-2 Layers: State Space Models with selective state spaces

    • Linear time complexity O(n) vs O(n²) for attention
    • Excellent for long-range dependencies
    • Large intermediate tensors (memory intensive)
  • 23 MoE Layers: Mixture-of-Experts

    • 128 routed experts per layer (2,944 total expert modules!)
    • 1 shared expert per layer (always active)
    • Top-6 routing: 6 experts activated per token
    • Massive capacity with sparse activation
  • 6 GQA Attention Layers: Grouped Query Attention

    • 32 attention heads, 2 key-value heads
    • Strategic placement for quality

Why This Matters

  • 30B total parameters but only ~3.5B active per token
  • Exceptional efficiency: near-3B inference cost, 30B capacity
  • State-of-the-art reasoning: 99.2% on AIME25 (with tools)

Quantization Strategy

TevunahAi Ultra-Hybrid Mixed-Precision:

Component Precision Rationale
Router/Gate FP32 Maximum routing precision - critical for MoE
GQA Attention (q/k/v/o_proj) INT8 Quality preservation for attention
Mamba-2 Projections (in/out_proj) INT4 SSM layers
MoE Experts (128/layer) INT4 Aggressive compression
Shared Experts INT4 Standard compression
Embeddings & LM Head BF16 Preserved for accuracy

Calibration

  • 2048 samples (8x industry standard of 256)
  • 4096 sequence length
  • Premium calibration for superior quality retention

Performance Benchmarks

Original Model (NVIDIA benchmarks):

Benchmark Score
AIME25 (with tools) 99.2%
AIME25 (no tools) 89.1%
LiveCodeBench 68.3%
SWE-Bench 38.8%
RULER-100@256k 92.9%
RULER-100@1M 86.3%

Expected Quantized Performance:

  • Reasoning tasks: 97-99% of baseline
  • Code generation: 96-98% of baseline
  • Long context: 95-98% of baseline (Mamba advantage)
  • General chat: 98-99% of baseline

Formal benchmarks pending - inference quality verified manually.

Usage

GPTQModel (Recommended):

import gptqmodel.models.loader as gptq_loader

# Bypass version check for NemotronH (required for transformers > 4.48)
_original_check_versions = gptq_loader.check_versions
def patched_check_versions(model_class, require_pkgs_version):
    if 'NemotronH' in str(model_class):
        return
    return _original_check_versions(model_class, require_pkgs_version)
gptq_loader.check_versions = patched_check_versions

from gptqmodel import GPTQModel
from transformers import AutoTokenizer

model = GPTQModel.load(
    "TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ",
    trust_remote_code=True
)

# Generate
prompt = "The capital of France is"
output = model.generate(
    **tokenizer(prompt, return_tensors='pt').to('cuda'),
    max_new_tokens=100
)
print(tokenizer.decode(output[0]))

With Chat Template:

messages = [{"role": "user", "content": "Solve: What is 23 * 47?"}]

tokenized = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to('cuda')

outputs = model.generate(
    tokenized,
    max_new_tokens=1024,
    temperature=1.0,  # Recommended for reasoning
    top_p=1.0
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Disable Reasoning (faster, less accurate for hard problems):

tokenized = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    enable_thinking=False,  # Disable reasoning traces
    add_generation_prompt=True,
    return_tensors="pt"
).to('cuda')

outputs = model.generate(
    tokenized,
    max_new_tokens=32,
    do_sample=False,  # Greedy for non-reasoning
    num_beams=1
)

Installation:

pip install gptqmodel transformers>=4.48

vLLM (Experimental):

pip install -U "vllm>=0.12.0"

vllm serve TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ \
    --max-num-seqs 8 \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --trust-remote-code

Known Issues

  • Transformers native loading: Due to the novel hybrid architecture (Mamba-2 + MoE + Attention), loading via AutoModelForCausalLM.from_pretrained() may encounter compatibility issues with optimum/transformers GPTQ integration. Use GPTQModel as shown above.
  • Version check bypass: The version check patch is required when using transformers > 4.48.3
  • Tokenizer regex warning: Can be safely ignored or fixed with fix_mistral_regex=True

Memory Requirements

Inference (quantized model):

  • Minimum: 20GB VRAM (short context)
  • Recommended: 24-32GB VRAM
  • For 128K context: 48GB+ VRAM

Quantization (reproduction):

  • Minimum: 32GB VRAM + 128GB RAM
  • Used: RTX 5000 Ada (32GB) + Dual Xeon Max 9480 (128GB HBM2e + 256GB DDR5)

Quantization Details

Specification Value
Method GPTQ with Ultra-Hybrid configuration
Quantizer GPTQModel 5.6.12
Time 268.2 minutes (4.5 hours)
Calibration Samples 2048 (8x industry standard)
Sequence Length 4096 tokens
Bits Per Weight 4.29 BPW estimated
desc_act True (activation ordering)
sym True (symmetric quantization)
group_size 128

Use Cases

Ideal for:

  • Mathematical reasoning (AIME-level problems)
  • Code generation & debugging (SWE-Bench capable)
  • Long-context analysis (up to 1M tokens)
  • Agentic applications (tool use, multi-step reasoning)
  • Resource-constrained deployment (60GB → 18GB)

Technical Specifications

Specification Value
Model Family NVIDIA Nemotron-3
Variant Nano-30B-A3B (Hybrid)
Total Parameters 30B
Active Parameters ~3.5B
Total Layers 52
Mamba-2 Layers 23
MoE Layers 23
Attention Layers 6
Experts per MoE Layer 128 (+1 shared)
Top-K Routing 6
Hidden Size 2688
Intermediate Size 1856
Context Length 128K (default), 1M (max)
Vocab Size 131072
Supported Languages EN, DE, ES, FR, IT, JA

License

NVIDIA Open Model License
https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/

Citation

@software{nemotron_nano_gptq_2024,
  title = {NVIDIA Nemotron-3-Nano-30B-A3B - TevunahAi Ultra-Hybrid GPTQ},
  author = {TevunahAi},
  year = {2024},
  note = {First GPTQ quantization of hybrid Mamba-2 + MoE + GQA architecture},
  url = {https://huggingface.co/TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ}
}

@misc{nvidia_nemotron_nano_v3_2025,
  title  = {Nemotron 3 Nano: Open, Efficient MoE Hybrid Mamba-Transformer},
  author = {NVIDIA},
  year   = {2025},
  url    = {https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf}
}

Acknowledgments

This quantization required patching GPTQModel's NemotronH model definition to properly handle:

  • Nested MoE expert module tree structure
  • Dynamic expert indexing (128 experts × 23 layers)
  • Mixed Mamba-2/MoE/Attention layer patterns

Special thanks to the GPTQModel team for their excellent quantization framework.


Quantized by TevunahAi
Professional AI Model Quantization Service
Specialized in hybrid architectures (Mamba, MoE, SSM)

https://huggingface.co/TevunahAi

Downloads last month
1,450
Safetensors
Model size
6B params
Tensor type
BF16
·
F32
·
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ

Quantized
(13)
this model

Collection including TevunahAi/Nemotron-3-Nano-30B-A3B-GPTQ

Evaluation results