nemotron-base-tokenizer

A drop-in tokenizer for continued pretraining (CPT) on the NVIDIA Nemotron-3 Base models (Super-120B-A12B-Base-BF16, Nano-30B-A3B-Base-BF16).

It is identical to geodesic-research/nemotron-instruct-tokenizer in vocabulary and BPE merges, but corrects the special-token bindings so that tokenizer.eos_token_id == 2 (</s>) — the actual document separator the NVIDIA Nemotron Base models were pretrained with.

Why this exists

NVIDIA ships their Base model checkpoints (e.g. nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16) with a tokenizer_config.json that declares eos_token: "<|im_end|>" (id 11). That assignment is correct for the chat / instruct variant, but <|im_end|> was never trained during pretraining — its embedding row is exactly zero in the Base checkpoint.

If you tokenize a CPT corpus with --append-eod against the upstream tokenizer, every document ends with token id 11. A Megatron CPT run on the Base checkpoint then sees frequent occurrences of a token whose embedding is the zero vector, and the very first backward pass overflows BF16 in the embedding gradient bucket — surfacing as:

RuntimeError: Rank N, ..., iteration 2: Unexpected result inf
(message='found Inf in local grad norm for bucket #0 in backward pass
before data-parallel communication collective')

The error is deterministic, reproducible across reruns, and unaffected by LR / PAO / DDP-overlap mitigations because the cause is upstream of the optimizer.

This tokenizer fixes the issue at the source: eos_token_id resolves to 2 (</s>), which the Base models were trained on. Megatron's --append-eod then writes id 2 at document boundaries, and CPT proceeds normally.

Empirical evidence

Embedding row norms (||W_emb[id]||) for the same special tokens, comparing the Nemotron-3 Super 120B Base checkpoint to its post-trained chat variant (...-A12B-BF16):

id	token	Base (pretraining-only)	Chat (post-trained)
0	`<unk>`	0.57	0.57
1	`<s>`	0.00 (untrained)	0.90
2	`</s>`	0.78	0.77
3	`[INST]`	0.00 (untrained)	0.89
4	`[/INST]`	0.00 (untrained)	0.90
10	`<\|im_start\|>`	0.00 (untrained)	0.88
11	`<\|im_end\|>`	0.00 (untrained)	0.89
12	`<think>`	1.13	1.13
13	`</think>`	1.09	1.09

Tokens 1, 3, 4, 10, and 11 are chat-template scaffolding — only the post-training stages (SFT / RL) ever wrote into them. For Base CPT, the only safe document separator is id 2 (</s>).

What's different from `nemotron-instruct-tokenizer`

File	Change	Notes
`special_tokens_map.json`	`eos_token: "</s>"` (was `"<\|im_end\|>"`)	Authoritative for `tokenizer.eos_token_id`
`special_tokens_map.json`	`pad_token` removed	Pretraining-format CPT does not pad
`tokenizer_config.json`	`eos_token: "</s>"`, `pad_token: null`	Mirrors the change for legacy loaders
`chat_template.jinja`	Removed	Base has no chat semantics
`tokenizer.json`	Unchanged	Same BPE: 131,072 vocab + 269,443 merges

Everything else — the BPE model, all 1000 added tokens, normalizer, pre-tokenizer, decoder, and post-processor — is byte-identical to nemotron-instruct-tokenizer. Token-id mapping for ids 0..131071 is preserved, so encodings of ordinary text are bit-for-bit identical to the instruct tokenizer (only the special-token defaults change).

When to use which Nemotron tokenizer

Stage	Tokenizer	Why
Pretraining-format CPT on a `*-Base-BF16` model	`geodesic-research/nemotron-base-tokenizer`	EOD = `</s>` matches Base's pretraining
SFT / chat-formatted training on instruct or post-CPT models	`geodesic-research/nemotron-instruct-tokenizer`	EOS = `<\|im_end\|>` matches chat templates
Reasoning-trained SFT (think tags)	`geodesic-research/nemotron-think-tokenizer`	think-template defaults

If you pick the wrong one for the wrong stage, you will either get the zero-embedding Inf described above (instruct tokenizer on Base CPT) or you will train without the chat-template machinery (base tokenizer on SFT).

Usage

Megatron `preprocess_data.py`

python 3rdparty/Megatron-LM/tools/preprocess_data.py \
  --input training.jsonl \
  --json-keys input \
  --output-prefix /path/to/output/tokenized \
  --tokenizer-type HuggingFaceTokenizer \
  --tokenizer-model geodesic-research/nemotron-base-tokenizer \
  --append-eod \
  --workers 32

Each document in the produced .bin will end with token id 2. Verify after preprocessing:

from megatron.core.datasets.indexed_dataset import IndexedDataset
ds = IndexedDataset('/path/to/output/tokenized_input_document')
last_token = ds.get(0)[-1]
assert last_token == 2, f'Expected EOD=2, got {last_token}'

Megatron Bridge training YAML

tokenizer:
  tokenizer_type: HuggingFaceTokenizer
  tokenizer_model: geodesic-research/nemotron-base-tokenizer

The runtime tokenizer should match the tokenizer used for preprocessing: otherwise tokenizer.eod and the document-separator id baked into the .bin files will disagree.

Sanity check

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('geodesic-research/nemotron-base-tokenizer')
assert tok.vocab_size == 131072
assert tok.eos_token_id == 2
assert tok.eos_token == '</s>'

Provenance

Derived from geodesic-research/nemotron-instruct-tokenizer, which is in turn derived from the upstream NVIDIA Nemotron-3 tokenizer. The BPE was not modified. Only the special-token defaults shipped in special_tokens_map.json and tokenizer_config.json were corrected, and the chat-template Jinja file was removed.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for geodesic-research/nemotron-base-tokenizer

Base model

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16

Finetuned

(1)

this model

Collection including geodesic-research/nemotron-base-tokenizer

Nemotron 3 Custom Tokenizers

Collection

3 items • Updated 11 days ago