Nemotron Instruct Tokenizer

A drop-in replacement for the Nemotron 3 tokenizer, purpose-built for non-reasoning instruct SFT runs. The encoder, vocabulary, and special-token IDs are byte-identical to upstream NVIDIA Nemotron 3, so existing model weights load and tokenize identically. The only change is the chat template: this tokenizer never injects <think> or </think> tags anywhere — neither during message rendering nor at generation-prompt time.

Why this exists

The upstream Nemotron 3 chat template is designed for reasoning-capable models. By default it:

Auto-prepends <think></think> to assistant messages that don't already contain think tags. So if your training data is {"role": "assistant", "content": "The answer is 42."}, the rendered string becomes <|im_start|>assistant\n<think></think>The answer is 42.<|im_end|>.
Wraps reasoning_content message fields in <think>...</think>.
Truncates older assistant turns in multi-turn history and replaces their content with <think></think> stubs (controlled by truncate_history_thinking, default True).
Emits <|im_start|>assistant\n<think>\n (or <think></think>) as the generation prompt depending on enable_thinking.

For an instruct-only SFT pipeline that never trains on reasoning traces, every one of these behaviors causes problems:

During training: the auto-prepend silently injects <think></think> into the loss-bearing region of every assistant turn, so the model learns to emit <think></think> literally — even when there's no reasoning to do.
At inference time: vLLM rollouts on the resulting model leak stray </think> tokens mid-response and sometimes repeat their answer twice, because the model was conditioned on think tags it has nothing to put inside.
The two upstream template revisions (10771-byte and 10505-byte) ship with conflicting enable_thinking defaults (True vs False), making it ambiguous what tokenizer.apply_chat_template(msgs, add_generation_prompt=True) returns without explicit kwargs.

Compatibility guarantees

Property	Status
`tokenizer.json` (vocab, merges, normalizer, pre-tokenizer, 1000 added_tokens)	byte-identical to `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` (sha256 `623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7`) — also byte-identical to `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16`
`tokenizer_config.json`	byte-identical to upstream Super 120B
`special_tokens_map.json`	byte-identical to upstream Super 120B
`chat_template.jinja`	rewritten (see below)
Special token IDs	unchanged: `<\|im_start\|>=10`, `<\|im_end\|>=11`, `<think>=12`, `</think>=13`, `<s>=1`, `</s>=2`, `<unk>=0`
Encoder behavior	`tok.encode(text)` returns the same IDs as upstream for any input text
Existing Nemotron checkpoints	Load and decode bit-identically — no resharding, no embedding remapping needed
vLLM	Compatible. `tokenizer_class: PreTrainedTokenizerFast` is set; no `backend`/`is_local` keys; no `auto_map` to custom Python files
transformers	Compatible with both 4.57.x (sfm-evals pin) and 5.x

The <think> and </think> tokens remain in the vocabulary at their original IDs. This means the tokenizer is fully compatible with reasoning models that emit those tokens — it just doesn't inject them itself. If you need reasoning-capable rendering, use the upstream Nemotron tokenizer instead.

What changed in the chat template

The chat template is the only file that differs from upstream. Six things were removed:

1. `<think></think>` auto-prepend on assistant content

Upstream (lines 110-119 in upstream Super):

{%- set content = message.content | default('', true) %}
{%- if content is string -%}
    {%- if '<think>' not in content and '</think>' not in content -%}
        {%- set content = "<think></think>" ~ content -%}
    {%- endif -%}
{%- endif -%}

This template:

{%- set content = message.content | default('', true) %}

Assistant content passes through verbatim.

2. `reasoning_content` → `<think>...</think>` wrapping

Upstream (lines 107-109): if a message has a reasoning_content field, the template wraps it in <think>...</think> and prepends to the regular content.

This template: removed entirely. The reasoning_content field is ignored.

3. `truncate_history_thinking` logic

Upstream (lines 14, 124-140, 161-175): when truncate_history_thinking=True (the default), older assistant turns have their think traces stripped and replaced with <think></think> stubs, and their content is partially truncated.

This template: removed. Older assistant turns are kept in full, exactly as supplied. The kwarg is no longer consulted.

4. `enable_thinking` two-branch generation prompt

Upstream (lines 12, 203-208):

{%- if add_generation_prompt %}
    {%- if enable_thinking %}
        {{- '<|im_start|>assistant\n<think>\n' }}
    {%- else %}
        {{- '<|im_start|>assistant\n<think></think>' }}
    {%- endif %}
{%- endif %}

This template:

{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}

Generation prompt always ends at the clean <|im_start|>assistant\n boundary. The enable_thinking kwarg is accepted but ignored.

5. `low_effort` reasoning-effort annotation

Upstream Super only (lines 13, 180-184): when low_effort=True, appends \n\n{reasoning effort: low} to the last user message. This signals the model to produce shorter reasoning traces.

This template: removed. The low_effort kwarg is accepted but ignored.

6. `last_user_idx` namespace tracking

Upstream (lines 16-22, 34-40): two scans over the message list to find the last user message index. Used by truncate_history_thinking and low_effort.

This template: both consumers removed, so the tracking is gone too. Saves 14 lines of dead Jinja.

What was kept

Everything else is identical to upstream Super 120B:

System message rendering
Tool definitions block (<tools>...) with all type/parameter/required/enum handling
Tool-call rendering inside assistant turns (<tool_call><function=...><parameter=...>)
Tool response rendering (<tool_response>...</tool_response>)
The <IMPORTANT> reminder block injected when tools are present
User and system message framing with <|im_start|> / <|im_end|>

Behavior reference

For inputs without any think tags, here's what each call produces:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("geodesic-research/nemotron-instruct-tokenizer")

msgs = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "2+2 equals 4."},
    {"role": "user", "content": "And 3+3?"},
]

Call	Output (last 60 chars)
`apply_chat_template(msgs)`	`...<\|im_start\|>user\nAnd 3+3?<\|im_end\|>\n`
`apply_chat_template(msgs, add_generation_prompt=True)`	`...<\|im_start\|>user\nAnd 3+3?<\|im_end\|>\n<\|im_start\|>assistant\n`
`apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=True)`	(same as above — kwarg ignored)
`apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=False)`	(same as above)

The full training-time render of the above messages contains zero <think> or </think> tokens. Compare with upstream Super, where the same input produces:

...<|im_start|>assistant\n<think></think>2+2 equals 4.<|im_end|>\n...

i.e. an injected <think></think> per assistant turn, plus a <think>\n opening at the generation-prompt boundary.

If your input messages contain explicit <think>...</think> content (e.g. you're rendering a dataset that already has reasoning traces from a teacher model), those think tags pass through verbatim. The template only refuses to inject think tags; it doesn't strip them from your input.

When to use a different tokenizer

Use case	Use this tokenizer?
Instruct SFT (no reasoning)	✅ Yes
Continued pretraining (CPT) on raw text	✅ Yes — chat template is irrelevant
LoRA / DoRA fine-tuning of an instruct model	✅ Yes
Reasoning / thinking SFT (e.g. with `<think>` traces in training data)	❌ Use `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` so the generation prompt opens a `<think>\n` block
Tool-calling agent SFT (no reasoning)	✅ Yes — tool rendering is preserved
Inference on a model that was trained with reasoning	❌ Mismatch — the model expects to see `<think>\n` opened on the generation prompt
Base model evaluation	⚠️ The chat template will work but produces an empty system header for messages with no system role; use the upstream `*-Base-BF16` tokenizer for consistency with base-model conventions

Usage with vLLM

# Standard vLLM serve — tokenizer is loaded from the model directory by default.
# To override with this tokenizer, pass --tokenizer:
vllm serve <your-model> \
    --tokenizer geodesic-research/nemotron-instruct-tokenizer \
    --chat-template /path/to/chat_template.jinja  # only needed if you want to override further

The tokenizer ships chat_template.jinja as a file (not embedded in tokenizer_config.json), which vLLM picks up automatically.

Usage in training (megatron-bridge / NeMo)

In your training YAML:

tokenizer:
  tokenizer_model: geodesic-research/nemotron-instruct-tokenizer

Or in the recipe definition:

cfg.tokenizer.tokenizer_model = "geodesic-research/nemotron-instruct-tokenizer"

The data pipeline (pipeline_data_prepare.py) will use this tokenizer's chat template when rendering messages columns from HuggingFace datasets, producing packed parquets with no injected think tags.

Provenance

Base: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 (revision 49ad1f46ee9df444a0a3b8b63520faa1ca66324a)
Encoder source: identical to NVIDIA Nemotron 3 family (Super 120B, Nano 30B, Base variants of either) — same tokenizer.json blob (sha256 623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7)
Chat template: derived from upstream Super 120B, with the six removals listed above
License: NVIDIA Open Model License (inherited from upstream)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for geodesic-research/nemotron-instruct-tokenizer

Base model

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

Finetuned

(15)

this model

Collection including geodesic-research/nemotron-instruct-tokenizer

Nemotron 3 Custom Tokenizers

Collection

3 items • Updated 14 days ago