nemotron-base-tokenizer

A drop-in tokenizer for continued pretraining (CPT) on the NVIDIA Nemotron-3 Base models (Super-120B-A12B-Base-BF16, Nano-30B-A3B-Base-BF16).

It is identical to geodesic-research/nemotron-instruct-tokenizer in vocabulary and BPE merges, but corrects the special-token bindings so that tokenizer.eos_token_id == 2 (</s>) โ€” the actual document separator the NVIDIA Nemotron Base models were pretrained with.

Why this exists

NVIDIA ships their Base model checkpoints (e.g. nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16) with a tokenizer_config.json that declares eos_token: "<|im_end|>" (id 11). That assignment is correct for the chat / instruct variant, but <|im_end|> was never trained during pretraining โ€” its embedding row is exactly zero in the Base checkpoint.

If you tokenize a CPT corpus with --append-eod against the upstream tokenizer, every document ends with token id 11. A Megatron CPT run on the Base checkpoint then sees frequent occurrences of a token whose embedding is the zero vector, and the very first backward pass overflows BF16 in the embedding gradient bucket โ€” surfacing as:

RuntimeError: Rank N, ..., iteration 2: Unexpected result inf
(message='found Inf in local grad norm for bucket #0 in backward pass
before data-parallel communication collective')

The error is deterministic, reproducible across reruns, and unaffected by LR / PAO / DDP-overlap mitigations because the cause is upstream of the optimizer.

This tokenizer fixes the issue at the source: eos_token_id resolves to 2 (</s>), which the Base models were trained on. Megatron's --append-eod then writes id 2 at document boundaries, and CPT proceeds normally.

Empirical evidence

Embedding row norms (||W_emb[id]||) for the same special tokens, comparing the Nemotron-3 Super 120B Base checkpoint to its post-trained chat variant (...-A12B-BF16):

id token Base (pretraining-only) Chat (post-trained)
0 <unk> 0.57 0.57
1 <s> 0.00 (untrained) 0.90
2 </s> 0.78 0.77
3 [INST] 0.00 (untrained) 0.89
4 [/INST] 0.00 (untrained) 0.90
10 <|im_start|> 0.00 (untrained) 0.88
11 <|im_end|> 0.00 (untrained) 0.89
12 <think> 1.13 1.13
13 </think> 1.09 1.09

Tokens 1, 3, 4, 10, and 11 are chat-template scaffolding โ€” only the post-training stages (SFT / RL) ever wrote into them. For Base CPT, the only safe document separator is id 2 (</s>).

What's different from nemotron-instruct-tokenizer

File Change Notes
special_tokens_map.json eos_token: "</s>" (was "<|im_end|>") Authoritative for tokenizer.eos_token_id
special_tokens_map.json pad_token removed Pretraining-format CPT does not pad
tokenizer_config.json eos_token: "</s>", pad_token: null Mirrors the change for legacy loaders
chat_template.jinja Removed Base has no chat semantics
tokenizer.json Unchanged Same BPE: 131,072 vocab + 269,443 merges

Everything else โ€” the BPE model, all 1000 added tokens, normalizer, pre-tokenizer, decoder, and post-processor โ€” is byte-identical to nemotron-instruct-tokenizer. Token-id mapping for ids 0..131071 is preserved, so encodings of ordinary text are bit-for-bit identical to the instruct tokenizer (only the special-token defaults change).

When to use which Nemotron tokenizer

Stage Tokenizer Why
Pretraining-format CPT on a *-Base-BF16 model geodesic-research/nemotron-base-tokenizer EOD = </s> matches Base's pretraining
SFT / chat-formatted training on instruct or post-CPT models geodesic-research/nemotron-instruct-tokenizer EOS = <|im_end|> matches chat templates
Reasoning-trained SFT (think tags) geodesic-research/nemotron-think-tokenizer think-template defaults

If you pick the wrong one for the wrong stage, you will either get the zero-embedding Inf described above (instruct tokenizer on Base CPT) or you will train without the chat-template machinery (base tokenizer on SFT).

Usage

Megatron preprocess_data.py

python 3rdparty/Megatron-LM/tools/preprocess_data.py \
  --input training.jsonl \
  --json-keys input \
  --output-prefix /path/to/output/tokenized \
  --tokenizer-type HuggingFaceTokenizer \
  --tokenizer-model geodesic-research/nemotron-base-tokenizer \
  --append-eod \
  --workers 32

Each document in the produced .bin will end with token id 2. Verify after preprocessing:

from megatron.core.datasets.indexed_dataset import IndexedDataset
ds = IndexedDataset('/path/to/output/tokenized_input_document')
last_token = ds.get(0)[-1]
assert last_token == 2, f'Expected EOD=2, got {last_token}'

Megatron Bridge training YAML

tokenizer:
  tokenizer_type: HuggingFaceTokenizer
  tokenizer_model: geodesic-research/nemotron-base-tokenizer

The runtime tokenizer should match the tokenizer used for preprocessing: otherwise tokenizer.eod and the document-separator id baked into the .bin files will disagree.

Sanity check

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('geodesic-research/nemotron-base-tokenizer')
assert tok.vocab_size == 131072
assert tok.eos_token_id == 2
assert tok.eos_token == '</s>'

Provenance

Derived from geodesic-research/nemotron-instruct-tokenizer, which is in turn derived from the upstream NVIDIA Nemotron-3 tokenizer. The BPE was not modified. Only the special-token defaults shipped in special_tokens_map.json and tokenizer_config.json were corrected, and the chat-template Jinja file was removed.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for geodesic-research/nemotron-base-tokenizer

Finetuned
(1)
this model

Collection including geodesic-research/nemotron-base-tokenizer