Nemotron Instruct Tokenizer

A drop-in replacement for the Nemotron 3 tokenizer, purpose-built for non-reasoning instruct SFT runs. The encoder, vocabulary, and special-token IDs are byte-identical to upstream NVIDIA Nemotron 3, so existing model weights load and tokenize identically. The only change is the chat template: this tokenizer never injects <think> or </think> tags anywhere — neither during message rendering nor at generation-prompt time.

Why this exists

The upstream Nemotron 3 chat template is designed for reasoning-capable models. By default it:

  1. Auto-prepends <think></think> to assistant messages that don't already contain think tags. So if your training data is {"role": "assistant", "content": "The answer is 42."}, the rendered string becomes <|im_start|>assistant\n<think></think>The answer is 42.<|im_end|>.
  2. Wraps reasoning_content message fields in <think>...</think>.
  3. Truncates older assistant turns in multi-turn history and replaces their content with <think></think> stubs (controlled by truncate_history_thinking, default True).
  4. Emits <|im_start|>assistant\n<think>\n (or <think></think>) as the generation prompt depending on enable_thinking.

For an instruct-only SFT pipeline that never trains on reasoning traces, every one of these behaviors causes problems:

  • During training: the auto-prepend silently injects <think></think> into the loss-bearing region of every assistant turn, so the model learns to emit <think></think> literally — even when there's no reasoning to do.
  • At inference time: vLLM rollouts on the resulting model leak stray </think> tokens mid-response and sometimes repeat their answer twice, because the model was conditioned on think tags it has nothing to put inside.
  • The two upstream template revisions (10771-byte and 10505-byte) ship with conflicting enable_thinking defaults (True vs False), making it ambiguous what tokenizer.apply_chat_template(msgs, add_generation_prompt=True) returns without explicit kwargs.

This tokenizer removes all four behaviors. Your assistant turns render as <|im_start|>assistant\n<content><|im_end|> exactly. Your generation prompts end at <|im_start|>assistant\n exactly. No surprises.

Compatibility guarantees

Property Status
tokenizer.json (vocab, merges, normalizer, pre-tokenizer, 1000 added_tokens) byte-identical to nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 (sha256 623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7) — also byte-identical to nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
tokenizer_config.json byte-identical to upstream Super 120B
special_tokens_map.json byte-identical to upstream Super 120B
chat_template.jinja rewritten (see below)
Special token IDs unchanged: <|im_start|>=10, <|im_end|>=11, <think>=12, </think>=13, <s>=1, </s>=2, <unk>=0
Encoder behavior tok.encode(text) returns the same IDs as upstream for any input text
Existing Nemotron checkpoints Load and decode bit-identically — no resharding, no embedding remapping needed
vLLM Compatible. tokenizer_class: PreTrainedTokenizerFast is set; no backend/is_local keys; no auto_map to custom Python files
transformers Compatible with both 4.57.x (sfm-evals pin) and 5.x

The <think> and </think> tokens remain in the vocabulary at their original IDs. This means the tokenizer is fully compatible with reasoning models that emit those tokens — it just doesn't inject them itself. If you need reasoning-capable rendering, use the upstream Nemotron tokenizer instead.

What changed in the chat template

The chat template is the only file that differs from upstream. Six things were removed:

1. <think></think> auto-prepend on assistant content

Upstream (lines 110-119 in upstream Super):

{%- set content = message.content | default('', true) %}
{%- if content is string -%}
    {%- if '<think>' not in content and '</think>' not in content -%}
        {%- set content = "<think></think>" ~ content -%}
    {%- endif -%}
{%- endif -%}

This template:

{%- set content = message.content | default('', true) %}

Assistant content passes through verbatim.

2. reasoning_content<think>...</think> wrapping

Upstream (lines 107-109): if a message has a reasoning_content field, the template wraps it in <think>...</think> and prepends to the regular content.

This template: removed entirely. The reasoning_content field is ignored.

3. truncate_history_thinking logic

Upstream (lines 14, 124-140, 161-175): when truncate_history_thinking=True (the default), older assistant turns have their think traces stripped and replaced with <think></think> stubs, and their content is partially truncated.

This template: removed. Older assistant turns are kept in full, exactly as supplied. The kwarg is no longer consulted.

4. enable_thinking two-branch generation prompt

Upstream (lines 12, 203-208):

{%- if add_generation_prompt %}
    {%- if enable_thinking %}
        {{- '<|im_start|>assistant\n<think>\n' }}
    {%- else %}
        {{- '<|im_start|>assistant\n<think></think>' }}
    {%- endif %}
{%- endif %}

This template:

{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}

Generation prompt always ends at the clean <|im_start|>assistant\n boundary. The enable_thinking kwarg is accepted but ignored.

5. low_effort reasoning-effort annotation

Upstream Super only (lines 13, 180-184): when low_effort=True, appends \n\n{reasoning effort: low} to the last user message. This signals the model to produce shorter reasoning traces.

This template: removed. The low_effort kwarg is accepted but ignored.

6. last_user_idx namespace tracking

Upstream (lines 16-22, 34-40): two scans over the message list to find the last user message index. Used by truncate_history_thinking and low_effort.

This template: both consumers removed, so the tracking is gone too. Saves 14 lines of dead Jinja.

What was kept

Everything else is identical to upstream Super 120B:

  • System message rendering
  • Tool definitions block (<tools>...) with all type/parameter/required/enum handling
  • Tool-call rendering inside assistant turns (<tool_call><function=...><parameter=...>)
  • Tool response rendering (<tool_response>...</tool_response>)
  • The <IMPORTANT> reminder block injected when tools are present
  • User and system message framing with <|im_start|> / <|im_end|>

Behavior reference

For inputs without any think tags, here's what each call produces:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("geodesic-research/nemotron-instruct-tokenizer")

msgs = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "2+2 equals 4."},
    {"role": "user", "content": "And 3+3?"},
]
Call Output (last 60 chars)
apply_chat_template(msgs) ...<|im_start|>user\nAnd 3+3?<|im_end|>\n
apply_chat_template(msgs, add_generation_prompt=True) ...<|im_start|>user\nAnd 3+3?<|im_end|>\n<|im_start|>assistant\n
apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=True) (same as above — kwarg ignored)
apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=False) (same as above)

The full training-time render of the above messages contains zero <think> or </think> tokens. Compare with upstream Super, where the same input produces:

...<|im_start|>assistant\n<think></think>2+2 equals 4.<|im_end|>\n...

i.e. an injected <think></think> per assistant turn, plus a <think>\n opening at the generation-prompt boundary.

If your input messages contain explicit <think>...</think> content (e.g. you're rendering a dataset that already has reasoning traces from a teacher model), those think tags pass through verbatim. The template only refuses to inject think tags; it doesn't strip them from your input.

When to use a different tokenizer

Use case Use this tokenizer?
Instruct SFT (no reasoning) ✅ Yes
Continued pretraining (CPT) on raw text ✅ Yes — chat template is irrelevant
LoRA / DoRA fine-tuning of an instruct model ✅ Yes
Reasoning / thinking SFT (e.g. with <think> traces in training data) ❌ Use nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 so the generation prompt opens a <think>\n block
Tool-calling agent SFT (no reasoning) ✅ Yes — tool rendering is preserved
Inference on a model that was trained with reasoning ❌ Mismatch — the model expects to see <think>\n opened on the generation prompt
Base model evaluation ⚠️ The chat template will work but produces an empty system header for messages with no system role; use the upstream *-Base-BF16 tokenizer for consistency with base-model conventions

Usage with vLLM

# Standard vLLM serve — tokenizer is loaded from the model directory by default.
# To override with this tokenizer, pass --tokenizer:
vllm serve <your-model> \
    --tokenizer geodesic-research/nemotron-instruct-tokenizer \
    --chat-template /path/to/chat_template.jinja  # only needed if you want to override further

The tokenizer ships chat_template.jinja as a file (not embedded in tokenizer_config.json), which vLLM picks up automatically.

Usage in training (megatron-bridge / NeMo)

In your training YAML:

tokenizer:
  tokenizer_model: geodesic-research/nemotron-instruct-tokenizer

Or in the recipe definition:

cfg.tokenizer.tokenizer_model = "geodesic-research/nemotron-instruct-tokenizer"

The data pipeline (pipeline_data_prepare.py) will use this tokenizer's chat template when rendering messages columns from HuggingFace datasets, producing packed parquets with no injected think tags.

Provenance

  • Base: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 (revision 49ad1f46ee9df444a0a3b8b63520faa1ca66324a)
  • Encoder source: identical to NVIDIA Nemotron 3 family (Super 120B, Nano 30B, Base variants of either) — same tokenizer.json blob (sha256 623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7)
  • Chat template: derived from upstream Super 120B, with the six removals listed above
  • License: NVIDIA Open Model License (inherited from upstream)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for geodesic-research/nemotron-instruct-tokenizer

Finetuned
(15)
this model

Collection including geodesic-research/nemotron-instruct-tokenizer