nemotron-base-tokenizer
A drop-in tokenizer for continued pretraining (CPT) on the NVIDIA Nemotron-3
Base models (Super-120B-A12B-Base-BF16, Nano-30B-A3B-Base-BF16).
It is identical to
geodesic-research/nemotron-instruct-tokenizer
in vocabulary and BPE merges, but corrects the special-token bindings so that
tokenizer.eos_token_id == 2 (</s>) โ the actual document separator the
NVIDIA Nemotron Base models were pretrained with.
Why this exists
NVIDIA ships their Base model checkpoints (e.g.
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16) with a tokenizer_config.json
that declares eos_token: "<|im_end|>" (id 11). That assignment is correct
for the chat / instruct variant, but <|im_end|> was never trained during
pretraining โ its embedding row is exactly zero in the Base checkpoint.
If you tokenize a CPT corpus with --append-eod against the upstream tokenizer,
every document ends with token id 11. A Megatron CPT run on the Base checkpoint
then sees frequent occurrences of a token whose embedding is the zero vector,
and the very first backward pass overflows BF16 in the embedding gradient
bucket โ surfacing as:
RuntimeError: Rank N, ..., iteration 2: Unexpected result inf
(message='found Inf in local grad norm for bucket #0 in backward pass
before data-parallel communication collective')
The error is deterministic, reproducible across reruns, and unaffected by LR / PAO / DDP-overlap mitigations because the cause is upstream of the optimizer.
This tokenizer fixes the issue at the source: eos_token_id resolves to 2
(</s>), which the Base models were trained on. Megatron's --append-eod
then writes id 2 at document boundaries, and CPT proceeds normally.
Empirical evidence
Embedding row norms (||W_emb[id]||) for the same special tokens, comparing
the Nemotron-3 Super 120B Base checkpoint to its post-trained chat variant
(...-A12B-BF16):
| id | token | Base (pretraining-only) | Chat (post-trained) |
|---|---|---|---|
| 0 | <unk> |
0.57 | 0.57 |
| 1 | <s> |
0.00 (untrained) | 0.90 |
| 2 | </s> |
0.78 | 0.77 |
| 3 | [INST] |
0.00 (untrained) | 0.89 |
| 4 | [/INST] |
0.00 (untrained) | 0.90 |
| 10 | <|im_start|> |
0.00 (untrained) | 0.88 |
| 11 | <|im_end|> |
0.00 (untrained) | 0.89 |
| 12 | <think> |
1.13 | 1.13 |
| 13 | </think> |
1.09 | 1.09 |
Tokens 1, 3, 4, 10, and 11 are chat-template scaffolding โ only the
post-training stages (SFT / RL) ever wrote into them. For Base CPT, the
only safe document separator is id 2 (</s>).
What's different from nemotron-instruct-tokenizer
| File | Change | Notes |
|---|---|---|
special_tokens_map.json |
eos_token: "</s>" (was "<|im_end|>") |
Authoritative for tokenizer.eos_token_id |
special_tokens_map.json |
pad_token removed |
Pretraining-format CPT does not pad |
tokenizer_config.json |
eos_token: "</s>", pad_token: null |
Mirrors the change for legacy loaders |
chat_template.jinja |
Removed | Base has no chat semantics |
tokenizer.json |
Unchanged | Same BPE: 131,072 vocab + 269,443 merges |
Everything else โ the BPE model, all 1000 added tokens, normalizer,
pre-tokenizer, decoder, and post-processor โ is byte-identical to
nemotron-instruct-tokenizer. Token-id mapping for ids 0..131071 is
preserved, so encodings of ordinary text are bit-for-bit identical to the
instruct tokenizer (only the special-token defaults change).
When to use which Nemotron tokenizer
| Stage | Tokenizer | Why |
|---|---|---|
Pretraining-format CPT on a *-Base-BF16 model |
geodesic-research/nemotron-base-tokenizer |
EOD = </s> matches Base's pretraining |
| SFT / chat-formatted training on instruct or post-CPT models | geodesic-research/nemotron-instruct-tokenizer |
EOS = <|im_end|> matches chat templates |
| Reasoning-trained SFT (think tags) | geodesic-research/nemotron-think-tokenizer |
think-template defaults |
If you pick the wrong one for the wrong stage, you will either get the zero-embedding Inf described above (instruct tokenizer on Base CPT) or you will train without the chat-template machinery (base tokenizer on SFT).
Usage
Megatron preprocess_data.py
python 3rdparty/Megatron-LM/tools/preprocess_data.py \
--input training.jsonl \
--json-keys input \
--output-prefix /path/to/output/tokenized \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model geodesic-research/nemotron-base-tokenizer \
--append-eod \
--workers 32
Each document in the produced .bin will end with token id 2. Verify after
preprocessing:
from megatron.core.datasets.indexed_dataset import IndexedDataset
ds = IndexedDataset('/path/to/output/tokenized_input_document')
last_token = ds.get(0)[-1]
assert last_token == 2, f'Expected EOD=2, got {last_token}'
Megatron Bridge training YAML
tokenizer:
tokenizer_type: HuggingFaceTokenizer
tokenizer_model: geodesic-research/nemotron-base-tokenizer
The runtime tokenizer should match the tokenizer used for preprocessing:
otherwise tokenizer.eod and the document-separator id baked into the
.bin files will disagree.
Sanity check
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('geodesic-research/nemotron-base-tokenizer')
assert tok.vocab_size == 131072
assert tok.eos_token_id == 2
assert tok.eos_token == '</s>'
Provenance
Derived from geodesic-research/nemotron-instruct-tokenizer, which is in
turn derived from the upstream NVIDIA Nemotron-3 tokenizer. The BPE was not
modified. Only the special-token defaults shipped in
special_tokens_map.json and tokenizer_config.json were corrected, and
the chat-template Jinja file was removed.