Nemotron Instruct Tokenizer
A drop-in replacement for the Nemotron 3 tokenizer, purpose-built for non-reasoning instruct SFT runs. The encoder, vocabulary, and special-token IDs are byte-identical to upstream NVIDIA Nemotron 3, so existing model weights load and tokenize identically. The only change is the chat template: this tokenizer never injects <think> or </think> tags anywhere — neither during message rendering nor at generation-prompt time.
Why this exists
The upstream Nemotron 3 chat template is designed for reasoning-capable models. By default it:
- Auto-prepends
<think></think>to assistant messages that don't already contain think tags. So if your training data is{"role": "assistant", "content": "The answer is 42."}, the rendered string becomes<|im_start|>assistant\n<think></think>The answer is 42.<|im_end|>. - Wraps
reasoning_contentmessage fields in<think>...</think>. - Truncates older assistant turns in multi-turn history and replaces their content with
<think></think>stubs (controlled bytruncate_history_thinking, defaultTrue). - Emits
<|im_start|>assistant\n<think>\n(or<think></think>) as the generation prompt depending onenable_thinking.
For an instruct-only SFT pipeline that never trains on reasoning traces, every one of these behaviors causes problems:
- During training: the auto-prepend silently injects
<think></think>into the loss-bearing region of every assistant turn, so the model learns to emit<think></think>literally — even when there's no reasoning to do. - At inference time: vLLM rollouts on the resulting model leak stray
</think>tokens mid-response and sometimes repeat their answer twice, because the model was conditioned on think tags it has nothing to put inside. - The two upstream template revisions (10771-byte and 10505-byte) ship with conflicting
enable_thinkingdefaults (TruevsFalse), making it ambiguous whattokenizer.apply_chat_template(msgs, add_generation_prompt=True)returns without explicit kwargs.
This tokenizer removes all four behaviors. Your assistant turns render as <|im_start|>assistant\n<content><|im_end|> exactly. Your generation prompts end at <|im_start|>assistant\n exactly. No surprises.
Compatibility guarantees
| Property | Status |
|---|---|
tokenizer.json (vocab, merges, normalizer, pre-tokenizer, 1000 added_tokens) |
byte-identical to nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 (sha256 623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7) — also byte-identical to nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 |
tokenizer_config.json |
byte-identical to upstream Super 120B |
special_tokens_map.json |
byte-identical to upstream Super 120B |
chat_template.jinja |
rewritten (see below) |
| Special token IDs | unchanged: <|im_start|>=10, <|im_end|>=11, <think>=12, </think>=13, <s>=1, </s>=2, <unk>=0 |
| Encoder behavior | tok.encode(text) returns the same IDs as upstream for any input text |
| Existing Nemotron checkpoints | Load and decode bit-identically — no resharding, no embedding remapping needed |
| vLLM | Compatible. tokenizer_class: PreTrainedTokenizerFast is set; no backend/is_local keys; no auto_map to custom Python files |
| transformers | Compatible with both 4.57.x (sfm-evals pin) and 5.x |
The <think> and </think> tokens remain in the vocabulary at their original IDs. This means the tokenizer is fully compatible with reasoning models that emit those tokens — it just doesn't inject them itself. If you need reasoning-capable rendering, use the upstream Nemotron tokenizer instead.
What changed in the chat template
The chat template is the only file that differs from upstream. Six things were removed:
1. <think></think> auto-prepend on assistant content
Upstream (lines 110-119 in upstream Super):
{%- set content = message.content | default('', true) %}
{%- if content is string -%}
{%- if '<think>' not in content and '</think>' not in content -%}
{%- set content = "<think></think>" ~ content -%}
{%- endif -%}
{%- endif -%}
This template:
{%- set content = message.content | default('', true) %}
Assistant content passes through verbatim.
2. reasoning_content → <think>...</think> wrapping
Upstream (lines 107-109): if a message has a reasoning_content field, the template wraps it in <think>...</think> and prepends to the regular content.
This template: removed entirely. The reasoning_content field is ignored.
3. truncate_history_thinking logic
Upstream (lines 14, 124-140, 161-175): when truncate_history_thinking=True (the default), older assistant turns have their think traces stripped and replaced with <think></think> stubs, and their content is partially truncated.
This template: removed. Older assistant turns are kept in full, exactly as supplied. The kwarg is no longer consulted.
4. enable_thinking two-branch generation prompt
Upstream (lines 12, 203-208):
{%- if add_generation_prompt %}
{%- if enable_thinking %}
{{- '<|im_start|>assistant\n<think>\n' }}
{%- else %}
{{- '<|im_start|>assistant\n<think></think>' }}
{%- endif %}
{%- endif %}
This template:
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- endif %}
Generation prompt always ends at the clean <|im_start|>assistant\n boundary. The enable_thinking kwarg is accepted but ignored.
5. low_effort reasoning-effort annotation
Upstream Super only (lines 13, 180-184): when low_effort=True, appends \n\n{reasoning effort: low} to the last user message. This signals the model to produce shorter reasoning traces.
This template: removed. The low_effort kwarg is accepted but ignored.
6. last_user_idx namespace tracking
Upstream (lines 16-22, 34-40): two scans over the message list to find the last user message index. Used by truncate_history_thinking and low_effort.
This template: both consumers removed, so the tracking is gone too. Saves 14 lines of dead Jinja.
What was kept
Everything else is identical to upstream Super 120B:
- System message rendering
- Tool definitions block (
<tools>...) with all type/parameter/required/enum handling - Tool-call rendering inside assistant turns (
<tool_call><function=...><parameter=...>) - Tool response rendering (
<tool_response>...</tool_response>) - The
<IMPORTANT>reminder block injected when tools are present - User and system message framing with
<|im_start|>/<|im_end|>
Behavior reference
For inputs without any think tags, here's what each call produces:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("geodesic-research/nemotron-instruct-tokenizer")
msgs = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "2+2 equals 4."},
{"role": "user", "content": "And 3+3?"},
]
| Call | Output (last 60 chars) |
|---|---|
apply_chat_template(msgs) |
...<|im_start|>user\nAnd 3+3?<|im_end|>\n |
apply_chat_template(msgs, add_generation_prompt=True) |
...<|im_start|>user\nAnd 3+3?<|im_end|>\n<|im_start|>assistant\n |
apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=True) |
(same as above — kwarg ignored) |
apply_chat_template(msgs, add_generation_prompt=True, enable_thinking=False) |
(same as above) |
The full training-time render of the above messages contains zero <think> or </think> tokens. Compare with upstream Super, where the same input produces:
...<|im_start|>assistant\n<think></think>2+2 equals 4.<|im_end|>\n...
i.e. an injected <think></think> per assistant turn, plus a <think>\n opening at the generation-prompt boundary.
If your input messages contain explicit <think>...</think> content (e.g. you're rendering a dataset that already has reasoning traces from a teacher model), those think tags pass through verbatim. The template only refuses to inject think tags; it doesn't strip them from your input.
When to use a different tokenizer
| Use case | Use this tokenizer? |
|---|---|
| Instruct SFT (no reasoning) | ✅ Yes |
| Continued pretraining (CPT) on raw text | ✅ Yes — chat template is irrelevant |
| LoRA / DoRA fine-tuning of an instruct model | ✅ Yes |
Reasoning / thinking SFT (e.g. with <think> traces in training data) |
❌ Use nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 so the generation prompt opens a <think>\n block |
| Tool-calling agent SFT (no reasoning) | ✅ Yes — tool rendering is preserved |
| Inference on a model that was trained with reasoning | ❌ Mismatch — the model expects to see <think>\n opened on the generation prompt |
| Base model evaluation | ⚠️ The chat template will work but produces an empty system header for messages with no system role; use the upstream *-Base-BF16 tokenizer for consistency with base-model conventions |
Usage with vLLM
# Standard vLLM serve — tokenizer is loaded from the model directory by default.
# To override with this tokenizer, pass --tokenizer:
vllm serve <your-model> \
--tokenizer geodesic-research/nemotron-instruct-tokenizer \
--chat-template /path/to/chat_template.jinja # only needed if you want to override further
The tokenizer ships chat_template.jinja as a file (not embedded in tokenizer_config.json), which vLLM picks up automatically.
Usage in training (megatron-bridge / NeMo)
In your training YAML:
tokenizer:
tokenizer_model: geodesic-research/nemotron-instruct-tokenizer
Or in the recipe definition:
cfg.tokenizer.tokenizer_model = "geodesic-research/nemotron-instruct-tokenizer"
The data pipeline (pipeline_data_prepare.py) will use this tokenizer's chat template when rendering messages columns from HuggingFace datasets, producing packed parquets with no injected think tags.
Provenance
- Base:
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16(revision49ad1f46ee9df444a0a3b8b63520faa1ca66324a) - Encoder source: identical to NVIDIA Nemotron 3 family (Super 120B, Nano 30B, Base variants of either) — same
tokenizer.jsonblob (sha256623c34567aebb18582765289fbe23d901c62704d6518d71866e0e58db892b5b7) - Chat template: derived from upstream Super 120B, with the six removals listed above
- License: NVIDIA Open Model License (inherited from upstream)