LilTii: A 0.6B Bengali Language Model that Outperforms Qwen

Community Article Published March 5, 2026

banner

TL;DR

Big multilingual foundation models pretty much run the show in NLP right now — but that dominance has also made language inequality worse. Low-resource languages often get the short end of the stick. In this work, we introduce LilTii, a 0.6B-parameter Bengali language model trained completely from scratch to help close that gap. Unlike earlier Bengali models that simply continue training from large, opaque multilingual models, LilTii is built through a fully transparent, reproducible pipeline. It's specifically designed to work well even in limited-compute environments. To make this happen, we compiled a high-quality Bengali corpus using both heuristic and learned filtering (LLM-as-a-judge). We supplemented it with carefully curated English data for bilingual augmentation. Using this dataset, we experiment with different training recipes for small-scale Bengali models. Across a wide range of Bengali benchmarks, LilTii consistently outperforms similarly sized multilingual models such as Qwen2.5-0.5B and Qwen3-0.6B. The takeaway? Maybe there is still room for pre-training in the small-scale/low-resource language scene.

Introduction

Deep learning has completely reshaped the NLP landscape. Thanks to Transformer architectures, we can now train models in parallel on massive datasets—and they perform incredibly well on everything from translation to text generation. But there's a catch: this whole paradigm runs on compute. Scaling laws have shown again and again that performance improves when you scale up data, model size, and compute (Hoffmann et al. 2022). Bigger models + more data + more FLOPs = better results.

Take Qwen2.5 (Qwen, 2024), for example. They released dense decoder models ranging from 0.5B to 72B parameters, all trained on roughly 18 trillion tokens. If you plug the numbers into Kaplan et al.'s formula (2020), you're looking at compute budgets reaching up to 6e24 FLOPs. Qwen3 models seem to operate at a similar compute scale, but with an even larger dataset—around 36 trillion tokens (Qwen, 2025).

The problem is that this “just scale it” mindset has quietly widened the gap in NLP. There are about 7,000 languages spoken around the world, and most of them are low-resource and at risk of disappearing. Yet high-quality datasets exist for only a small fraction of them (Cohere Labs, 2024).

And being “low-resource” isn't just about having less web data. It's also about infrastructure, funding, access to compute, and technical expertise. These structural issues make it much harder for many communities to participate in foundation model development. Even though more open-weight models are being released—like the Qwen and LLaMA families—they only partially solve the problem. Most don't fully disclose how they were built. Truly open, reproducible research at the cutting edge of foundation models is still concentrated in a small number of institutions, like AllenAI and Hugging Face.

To address this growing language gap, the NLP community has increasingly leaned into multilingual foundation models. The idea sounds great: train one big model that works across dozens of languages, so the benefits of large-scale language modeling aren't limited to English and a handful of other high-resource languages. But in practice, things aren't that simple.

Multilingual models still suffer from major data imbalances. High-resource languages dominate both the quantity and quality of training data, while low-resource languages are often included in name more than in substance. As a result, monolingual models frequently outperform multilingual ones on tasks in their own language. We've already seen this in Finnish (Virtanen et al. 2019), French (Martin et al. 2019), Catalan (Armengol-Estapé et al. 2021), and Portuguese (Corrêa et al. 2024).

So while multilingual models offer breadth, they often lack depth for individual languages—especially when enough monolingual data is available. Similarly, continual pretraining (CPT) can be a practical shortcut for building capable models under tight resource constraints. But it has clear limitations, particularly when applied to black-box foundation models where the original pretraining data, training recipes, and hyperparameters aren't fully disclosed.

This is where our work comes in.

We focus on Bengali, one of the most widely spoken languages in the world. Despite its rich literary and cultural history, it remains underrepresented in NLP. To help change that, we introduce LilTii, a 0.6B-parameter Bengali language model trained entirely from scratch on a carefully curated Bengali dataset. Instead of diluted multilingual coverage, we prioritize building a strong native foundation for one language.

Beyond the model itself, we also contribute a fully open and reproducible pipeline for training language models in low-resource settings. Building on existing open research, we adapt and tweak established techniques to work within our constraints — limited compute and limited high-quality data.

The outcome is a collection of resources aimed at advancing Bengali NLP:

  • A high-quality Bengali dataset (knowledge cut-off: December 2025) with over 23 billion tokens from diverse domains, filtered using both heuristic and learning-based methods.
  • Auxiliary annotated datasets for training Bengali-specific quality and toxicity filters, plus two lightweight models for these tasks—enabling scalable data cleaning without relying on external APIs.
  • A comprehensive, reproducible evaluation suite for benchmarking NLP models on Bengali tasks.
  • LilTii itself: a native Bengali model that outperforms similarly sized multilingual Qwen models across benchmarks in our custom evaluation suite.

All source code used, datasets, and models are released under permissive licenses, making it easier for others to build on this work and continue pushing Bengali NLP forward.

State-of-the-Art

Work on Bengali language models has definitely picked up in recent years, especially with the broader push toward multilingual and open-source NLP. That said, progress has been scattered. Most efforts rely on continual pretraining of existing multilingual giants instead of building fully independent pipelines from scratch. On top of that, issues like data quality, reproducibility, and thorough evaluation keep coming up.

Let's walk through some of the key contributions.

One of the earlier Bengali-focused efforts was SahajBERT (Diskin et al. 2021). Built using the DeDLOC framework, it trained an ALBERT-large-style model on Bengali Wikipedia and OSCAR—but here's the twist: it was trained across 91 volunteer devices, including CPU-only machines. Pretty impressive. It achieved strong results (like 95.45% F1 on WikiANN and 91.97% accuracy on news classification), beating IndicBERT and bnRoBERTa and even competing with XLM-R Large. The model was publicly released on Hugging Face, demonstrating that distributed training is possible. Still, it relied on relatively noisy data and was small by modern LLM standards.

Then came BanglaBERT and BanglishBERT (Bhattacharjee et al. 2022). These models were trained on 27.5 GB (2.18B tokens) of curated Bengali text from the Bangla2B+ corpus. Notably, they avoided OSCAR due to concerns about offensive content. With a 32k WordPiece vocabulary, they handled code-switching and romanized Bengali, and BanglishBERT even reserved half its vocabulary for English to enable zero-shot transfer. They performed well on NLU tasks, but the dataset remains proprietary due to ethical concerns about incomplete cleaning.

As generative models took over, the focus shifted to decoder architectures like LLaMA and continual pretraining. BongLLaMA (Zehady et al. 2024) fine-tuned LLaMA-2/3 models (1B–8B) on ~6.8B Bengali tokens from CulturaX. They expanded the tokenizer and instruction-tuned using Bangla-Alpaca-Orca (a translated Alpaca dataset plus filtered OpenOrca data). Training was efficient thanks to LoRA (around 40 GPU-hours per epoch on A100s). Evaluation used GPT-4o as an LLM judge on 120 queries, with some manual review. The models were released under Llama's community license. Still, translation artifacts and opaque evaluation protocols — especially when relying on proprietary APIs — make it harder to fully reproduce or extend the work.

A different direction came from Goldfish (Chang et al. 2024). Instead of continual pretraining, Goldfish trained small monolingual autoregressive Transformers (up to 125M parameters) independently for 350 languages, including Bengali. Interestingly, they showed that large multilingual models like XGLM (4.5B) and BLOOM (7.1B) sometimes underperform even bigram baselines on FLORES perplexity for low-resource languages. Despite being 10× smaller, Goldfish beat BLOOM, XGLM, and MaLA-500 on 98 of 204 FLORES languages. That's a strong signal: language-specific pretraining can outperform massive multilingual scaling. That said, Goldfish models are still too small and lightly trained to compete with today's models.

Returning to continual pretraining, TituLLMs (Kabir et al. 2025) adapted LLaMA-3.2 (1B and 3B) using a 37B-token Bengali corpus from web data, books, and synthetic sources. Trained for 1750 H100 hours, they did well on commonsense reasoning (e.g., 0.60 on PIQA for the 3B model) but struggled on knowledge-heavy tasks like Bangla MMLU. The team released a clean corpus and an EleutherAI-style evaluation harness on GitHub (which we built on in our work), contributing valuable baselines and tools.

More recently, TigerLLM (Raihan and Zampieri, 2025) argued that earlier efforts relied too heavily on low-quality translated or OSCAR-derived data. Inspired by Gunasekar et al. (2023), they emphasized data quality over scale. Their models (1B based on LLaMA-3.2 and 9B on Gemma-2) were continually pretrained on Bangla-TextBook (10M tokens from 163 textbooks) and then fully fine-tuned (no LoRA) on Bangla-Instruct—a 100k example dataset generated via self-instruct using GPT-4o and Claude-3.5-Sonnet. The models and datasets are on Hugging Face.

Looking across all these efforts, a few patterns stand out:

  • Continual pretraining dominates, largely because full training-from-scratch pipelines are expensive.
  • Data filtering often relies on heuristics rather than learned filters, which means noisy (and sometimes toxic) web data still slips through.
  • While many projects contribute valuable assets — datasets, baselines, evaluation harnesses — fully reproducible end-to-end LLM stacks (with code and documentation) are still rare.

On the bright side, open research projects like OlMO (2024), SmolLM (2025), LLM360 (2024), DeepSeek (2024), and Apertus (Hernández-Cano et al. 2025) provide detailed training recipes and transparency. The challenge? These approaches often assume access to large compute clusters and huge datasets — not exactly realistic for low-resource language communities. So the real question is how to adapt those ideas under tighter constraints.

That brings us to the core question behind our work:

  • "How can we effectively leverage limited compute and high-quality data to train competitive Bengali language models? Can strategic data curation and modern multistage training recipes allow us to compete with much larger multilingual models?"

In essence, two main ideas can guide our approach:

  • Focusing on Bengali-specific pretraining could let us match the performance of much larger multilingual models—without needing huge compute resources.
  • Using carefully chosen high-quality data from high-resource languages like English might give an extra boost, especially when there's a natural cross-lingual connection.

In short, with smart data curation and training strategies, it's possible to build strong Bengali LLMs on a smaller scale—similar to what Goldfish (Chang et al. 2024) showed, where smaller monolingual models outperformed much bigger multilingual ones.

Methods

Here, we'll walk through how we built our pretraining dataset, trained the tokenizer, designed the model architecture, and picked hyperparameters. We'll also cover how we optimized performance and set up the experiments to train and evaluate our models.

Building a High-Quality Bengali Corpus

Working with low-resource languages comes with a big challenge: there just isn't enough data. While large-scale web crawls like CommonCrawl (CC), OSCAR (Abadji et al. 2022), and other CC-derived datasets provide a starting point, their coverage of low-resource languages such as Bengali is often sparse. Plus, they tend to include a lot of noise—low-quality text, duplicates, or offensive content.

To tackle this, we took a multi-pronged approach to build a high-quality Bengali corpus. First, we pulled data from existing datasets on Hugging Face. These datasets are curated or semi-curated, giving us a solid foundation (see Table 2). Unlike raw web crawls, they already include some quality control, so we could skip basic cleaning like stripping HTML tags.

But relying on these datasets alone wouldn't give us enough diversity, so we also added web-crawled data from more recent CommonCrawl snapshots. To clean and filter this data, we implemented the FineWeb2 pipeline (Penedo et al. 2025). We supplemented it with a learned-filter approach using an LLM-as-a-Judge, inspired by works like FineWeb-Edu (Penedo et al. 2024), the Phi paper (Gunasekar et al. 2023).

Text Extraction → LID → Heuristic Filters → Deduplication

We wanted to build a solid Bengali text corpus, so we set up a data pipeline inspired by the FineWeb2 approach and used the Datatrove library. The pipeline handles everything: extracting text, identifying language, filtering for quality, and removing duplicates, so we end up with clean, usable data. In short:

  • We use the Trafilatura library (Barbaresi, 2021) to extract text.
  • We filter documents tied to "bad" URLs, using blocklists like Maravento's blackweb to skip low-quality or inappropriate sites.
  • We perform two rounds of LID using FastText (Joulin et al. 2016) (threshold 0.65) and GlotLID (Kargaran et al. 2023) (threshold 0.87).
  • For (heuristic) quality filtering, we use GopherRepetitionFilter, FineWebQualityFilter, and GopherQualityFilter.
  • We implement a deduplication pipeline using the MinHash algorithm to address redundancy.

Distilling Quality Annotations via LLM-as-a-Judge

Learned filters can add an extra layer of quality control beyond the usual rule-based ones. These filters, usually powered by large language models, look at things like context and coherence—stuff that simple heuristics can miss.

We used Qwen/Qwen2.5-32B-Instruct to review documents that passed the initial filters. We chose the 32B-Instruct model for its strong performance and Apache-2.0 license, which allows us to share the annotated data openly.

Two different prompts were used to assess the quality of documents:

  1. Educational Quality: This prompt is inspired by the one used in the FineWeb-Edu dataset. It prompts the judge model to evaluate whether the document is suitable for educational purposes and to rank it on a 5-point scale. See the prompt here.
  2. Toxicity: This prompt is designed to assess the presence of toxic or offensive content in the document. It asks the judge model to classify the document based on its toxicity level, ranking it on a 5-point scale (1 = non-toxic, 5 = highly toxic). See the prompt here.

We started by annotating a random sample of 320,000 documents from our filtered corpus, ensuring equal representation across all subsets.

Using these annotations, we trained two lightweight classification models. Since Bengali doesn't have many BERT-style models, we compared two native options—BanglaBERT (Bhattacharjee et al. 2022) and SahajBERT (Diskin et al. 2021)—and picked the best one.

Training ran for 20 epochs with a batch size of 256 (max length 512), using AdamW with betas=(0.9,0.999), epsilon=1e-8, weight_decay=0, and a learning rate starting at 3e-4, linearly decayed to zero, gave us the following results:

Performance of Educational and Toxicity Classifiers on Held-Out Test Set.

Model Name Task Precision Recall F1 Macro Accuracy
BanglaBERT Educational Classification 0.57 0.54 0.56 0.73
SahajBERT 0.53 0.43 0.45 0.67
BanglaBERT Toxicity Classification 0.64 0.59 0.61 0.77
SahajBERT 0.58 0.38 0.39 0.63

Our results show that BanglaBERT outperforms SahajBERT across both tasks, achieving higher F1 Macro and accuracy scores. For educational classification, the binary setup yields an accuracy of 82.02% with an F1 score of 0.68. For toxicity classification, performance is even stronger with 90.60% accuracy and an F1 score of 0.82. These results gave us confidence in using these models as learned filters for our corpus curation process, specifically to effectively filter out low-quality and toxic content.

Final Dataset Composition

Once the full filtering pipeline was done, we did a couple more clean-up steps: (1) removed docs with fewer than 50 tokens, and (2) removed docs with a toxicity score above 3. Bengali web data can be tricky when it comes to toxicity, so we kept the filtered-out docs as a separate subset—useful for anyone working on toxicity detection or mitigation in Bengali NLP. Both subsets are on Hugging Face, and you can find more info in the dataset card.

Tokenization

To properly train our Bengali language model, we needed a tokenizer that really captures the quirks of the Bengali script. So, we built one from scratch using a carefully chosen set of documents—2 million high-quality ones (Edu score ≥ 3) from our main corpus. Since we also plan to work with Bengali-English mixed data, we added 2 million more documents from the top part of the FineWeb-Edu dataset (Edu score ≥ 3) and 975,000 samples from bigcode/starcoderdata. This way, our tokenizer can handle code-mixed data, which is handy for future Bengali code-mixed LLMs.

To see how well our tokenizer works, we looked at two metrics from Rust et al. 2020:

  • Subword Fertility (SF): Average number of tokens per actual word. Lower means fewer splits; 1 is perfect, meaning every word is in the tokenizer.
  • Proportion of Continued Words (PCW): How often words get split into 2+ tokens. 0 means never splits, 1 means always splits.

Here's how our custom tokenizer stacks up against some common baselines.

Tokenizer Number of tokens (for 31500 words) Vocabulary size Fertility PCW UNK count
LilTii 52299 49152 1.66 0.57 0
SahajBERT 68562 32000 2.17 0.49 1
BanglaBERT 60380 32000 1.91 0.39 935
Qwen2.5 129273 151665 4.10 0.63 0
Falcon-H1 286166 32768 9.08 0.68 0
Llama-3.2 141980 128256 4.50 0.64 0
Gemma-2-2b 80569 256000 2.55 0.58 0

Evaluation Suite

Before diving into model training, we had two main goals: (1) put together a solid evaluation setup to test our models across different tasks in a reproducible way, and (2) set up baselines using existing multilingual models to give our results some context.

For evaluations, we looked at prior work on Bengali and focused on those that could be easily automated, such as those using EleutherAI's Language Model Evaluation Harness (Gao et al. 2024). We found two studies that fit the bill: Nahin et al. (2025) and Lai et al. (2023).

Nahin et al. created five new Bengali benchmarking datasets—Bangla MMLU, BoolQ BN, CommonsenseQA BN, OpenBookQA BN, and PIQA BN. Lai et al. translated four English datasets—ARC-Challenge, HellaSwag, MMLU, and TruthfulQA—into Bengali (and other languages). Both teams provided working forks of the EleutherAI evaluation harness, which we used as a starting point for our own tests. All these evaluations are unified in the following fork/branch of the EleutherAI evaluation harness: Polygl0t/lm-evaluation-harness.

Summary of Bengali Evaluation Benchmarks

Benchmark n-shot Baseline Metric
BanglaMMLU 5-shot 0.25 acc norm
BoolQ-BN 5-shot 0.50 acc norm
CommonSenseQA-BN 5-shot 0.20 acc norm
OpenBookQA-BN 5-shot 0.25 acc norm
PIQA-BN 5-shot 0.50 acc norm
ARC-Challenge 5-shot 0.25 acc norm
MMLU 5-shot 0.25 acc norm
HellaSwag 0-shot 0.25 acc norm
TruthfulQA 0-shot 0.225 bleurt

For a baseline, we picked models from the Qwen series (Qwen2.5, 2024, Qwen3, 2025) because they (1) do really well on multilingual benchmarks and (2) have a similar size to our models (~0.6B parameters).

To get a single score summarizing performance across all benchmarks, we used the Normalized Preferred Metric (NPM) from Pires et al. (2023). NPM basically normalizes each task's score so that tasks with easier baselines (like 50% for PIQA) don't skew the average compared to harder tasks (like 25% for OpenBookQA). The formula is:

NPM=1Ni=1N100×Preferred MetriciRandom ScoreiMax ScoreiRandom Scorei\text{NPM} = \frac{1}{N} \sum_{i=1}^{N} 100 \times \frac{\text{Preferred Metric}_i - \text{Random Score}_i}{\text{Max Score}_i - \text{Random Score}_i}

Model Architecture

Our model architecture is based on the Llama architecture, which incorporates:

  • RMSnorm for normalization (Zhang and Sennrich, 2019).
  • RoPE positional embeddings (Su et al. 2021).
  • SwiGLU activations (Shazeer, 2020).

The dimensions of our model, a 0.6B (670,127,616) parameter model, are summarized below.

Parameter Value
Activation function SwiGLU
Hidden layer size 1,536
Feed-forward (intermediate) size 3,072
Maximum context length 4,096 tokens
Number of attention heads 16
Number of layers 28
Attention head dimension 96
Number of key/value heads 8
Tied input/output embeddings True
Vocabulary size 49,152

Three main factors shaped our choice of model architecture and size: GPU efficiency, model expressiveness, and avoiding early saturation.

  • GPU efficiency: We followed NVIDIA's guidelines to keep key dimensions GPU-friendly.
  • Model expressiveness: We went with a “deep and slim” setup. Research shows that deeper models with moderate width tend to generalize and reason better in small-to-mid-sized transformers (e.g., MobileLLM (Liu et al. 2024), SmolLM2 & 3 (Ben Allal et al. 2025, Bakouch et al. 2025), ModernBERT (Warner, 2023)).
  • Avoiding saturation: Studies like Pythia (Biderman et al. 2023) and Llama (Touvron et al. 2023) suggest that models above ~0.4B parameters keep benefiting from longer training without hitting early saturation.

Also, keeping our model around ~0.6B parameters lets us run experiments faster with our available compute, while keeping things fair compared to the Qwen baselines of similar size.

Infrastructure and Optimizations

Our training experiments were conducted on the Marvin HPC cluster at the University of Bonn. Marvin is a hybrid HPC system that features GPU partitions with NVIDIA A100-SXM4-80GB and NVIDIA A40-48GB, high-speed NVLink and InfiniBand NDR interconnects, a Lustre parallel file system for multi-petabyte storage, and a SLURM workload manager for job scheduling. All of our training experiments were performed using the nodes equipped with NVIDIA A100 GPUs. At the same time, evaluations were conducted on the A40 nodes. Data preprocessing, filtering, and tokenization were performed on the CPU nodes of the cluster. For our training runs, we used 2 nodes, each with 4 A100 GPUs, for a total of 8 GPUs/replicas for distributed training.

You can find our implementations and full code stack in this repository: Polygl0t/llm-foundry. All tricks and optimizations (we achieved an MFU of approximately 70% during our training runs!) we used to make the most of our training are explicitly documented there.

Training Recipes

We set up two training approaches for our experiments:

  • Simple/Single-Stage (v1): This one uses only Bengali text, with a cosine learning rate decay and warmup. Think of it as a straightforward monolingual baseline to see what a vanilla setup can achieve.
  • Multi-Stage (v2): This uses a mix of Bengali and English text, trained in stages. Early stages balance Bengali and English educational content, while later stages upsample the highest-quality data.

Both recipes share some key settings to keep comparisons fair:

  • Total batch size: 2,097,152 tokens; micro batch: 262,144 (16 × 4 gradient accumulation steps).
  • Optimizer: AdamW with the same hyperparameters (max LR 7e-4, β₁=0.9, β₂=0.95, ε=1e-8).
  • Weight decay: 0.1, applied selectively (biases, layer norm weights, and token embeddings are excluded).
  • Gradient clipping: max norm of 1.0.

We evaluate the models every 2,500 steps on the full benchmark suite to track progress across all tasks throughout training.

To pick the shared batch size (2,097,152 tokens) and max learning rate (7e-4), we leaned on heuristics from the DeepSeek LLM paper (2024). They propose scaling laws that link your compute budget ($C$) to optimal hyperparameters—basically, batch size and learning rate —both of which scale predictably with total compute. We estimated our compute budget using an adjusted version of DeepSeek's formula (a tweak of the standard PaLM formula, $C = 6ND$) that accounts for non-embedding FLOPs per token.

C  =  (72nlayerdmodel2  +  12nlayerdmodelseq)D C \;=\; (72 \, n_{\text{layer}} \, d_{\text{model}}^{2} \;+\; 12 \, n_{\text{layer}} \, d_{\text{model}} \, \ell_{\text{seq}}) D

By plugging in our model's dimensions and the estimated dataset sizes for our two training recipes, we obtained the compute budget $C$ for our experiments. Using the DeepSeek scaling heuristics:

Max Learning Rate=0.3118C0.125 \text{Max Learning Rate} = 0.3118 \cdot C^{-0.125}

Batch Size=0.2920C0.3271 \text{Batch Size} = 0.2920 \cdot C^{0.3271}

We then derived the corresponding optimal hyperparameters for a common batch size of approximately 2 million tokens and a maximum learning rate of 7e-4. The batch size was rounded to the nearest power of two ($2^{21}$) for hardware efficiency.

Recipe 1: Simple (Single-Stage) Recipe

We started with a "simple" single-stage training recipe to test our ideas. This one uses only Bengali text to set a monolingual baseline—basically, to see if adding English in a multi-stage approach actually helps.

Since high-quality Bengali text is limited compared to English, this recipe focuses on depth and making the most of a data-constrained setup. Following Muennighoff et al. (2023), we repeated our Bengali dataset over 5 epochs, totaling around ~100 billion tokens. There's no staged curriculum here—the model just trains from scratch on the full shuffled dataset each epoch. We call it "simple" because it's a straightforward setup, similar to that used for the Pythia models (Biderman et al. 2023).

Using the $C = 6ND$ formula from Chowdhery et al. (2022), this training recipe needs about 3.6e20 FLOPs to finish.

Recipe 2: Multi-Stage Recipe

The second training approach we tried is a bit more involved—it's a multi-stage process that mixes Bengali with top-notch English text.

For the English part, we picked open datasets that are heavy on educational stuff, reasoning, and math problem-solving. Recent studies (e.g., Team OLMo, 2024, and Ben Allal et al., 2025) show that gradually introducing high-quality data can significantly boost a model's performance, especially for reasoning and knowledge-intensive tasks. Since our Bengali dataset doesn't have a lot of this type of content, we figured adding English datasets could help the model learn better across languages.

Summary of English and Bengali Datasets Used in Multi-Stage Recipe

Dataset Name Subset Size (Tokens)
Polygl0t/gigakriya-v1 All 20B
Edu Score of 1 5.87B
Edu Score of 2 8.62B
Edu Score of 3 4.25B
Edu Score of 4 1.52B
Edu Score of 5 5.50M
HuggingFaceFW/fineweb-edu All 49.29B
Edu Score of 3 35.00B
Edu Score of 4 14.22B
Edu Score of 5 69.61M
HuggingFaceTB/finemath All 9.66B
Edu Score of 4 8.59B
Edu Score of 5 1.08B
HuggingFaceTB/smollm-corpus (Cosmopedia v2) All 25.0B
allenai/big-reasoning-traces All 2.44B
allenai/math-meta-reasoning-filtered All 1.24B
nvidia/OpenScience All 9.87B

We set up a three-phase training plan using these datasets: (1) Warmup+Stable, (2) Stable, and (3) Stable+LinearDecay. Each phase mixes the data slightly differently, gradually giving more weight to higher-quality data.

We kept the Bengali-to-English ratio around 60:40 for most of the training to see how bilingual exposure affects the model, and in the final phase, we nudged Bengali up to 50%.

Language Proportions Across Stages in Multi-Stage Recipe

Stage Bengali (%) English (%)
Warmup+Stable 40 (~40B) 60 (~60B)
Stable 40 (~40B) 60 (~60B)
Stable+LinearDecay 50 (~15B) 50 (~15B)

How often each portion of our data mixture was repeated during training was designed to achieve a total training volume of ~230 billion tokens without over-sampling any particular subset.

Each stage employs a specific moment of a "trapezoidal" learning rate (Hägele et al. 2024), also referred to as Warmup-Stable-Decay (WSD) (Hu et al. 2024). This type of schedule is comprised of 3 distinct phases: (1) an initial warmup phase where the learning rate increases linearly from zero to the peak learning rate, (2) a stable phase where the learning rate is held constant at the peak value, and (3) a decay phase where the learning rate decreases linearly to a minimum value. Studies like the ones performed by Ben Allal et al. (2025) and Bakouch et al. (2025) have shown that this style of learning rate schedule can promote stable convergence and better generalization, especially in multi-stage training setups where there are no predetermined epoch boundaries.

A more complete description of the data mixtures, training volumes, and learning rate schedules found in:

Results

Learning Curves and Stability

Let's start with the learning curves. Overall, both training recipes were stable — no divergence, no crashes, no need to roll anything back.

Learning Curves of Simple and Multi-Stage Training Recipes

learning_curves

Gradient Norm (L2) of Simple and Multi-Stage Training Recipes

gradient_statistics_v1

LilTii-v0.1 (Simple/Single-Stage Recipe)

gradient_statistics_v2

LilTii-v0.2 (Multi-Stage Recipe)

Benchmark Results

Below we show how our models performed across all nine benchmarks in our evaluation setup.

The big headline: our multi-stage model beats both Qwen baselines on several tasks — even though it was trained with way less compute (~8.28e20 FLOPs vs. ~1.29e23 FLOPs for Qwen3-0.6B). That's a pretty massive difference.

Here are some benchmarks where we see a clear win for LilTii:

🏆 HellaSwag

hellaswag

🏆 PIQA-BN

piqa-bn

🏆 ARC Challenge

arc_challenge

🏆 CommonsenseQA-BN

commonsenseqa-bn

🏆 OpenBookQA-BN

openbookqa-bn

On the flip side, Qwen still has the upper hand on MMLU, Bangla MMLU, BoolQ-BN, and TruthfulQA (although the gap is smaller on the last two).

🏆 BoolQ-BN

boolq-bn

🏆 MMLU

mmlu

🏆 TruthfulQA

truthfulqa_mc1

🏆 Bangla MMLU

bangla_mmlu

Our guess? Those benchmarks are extremely pretraining-hungry. As Gu et al. (2024) point out, some multi-choice benchmarks (especially MCF-style ones) tend to look almost random during early training. Models only start improving on them after a lot of pretraining. So it's not too surprising that with much lower compute, we don't dominate there yet.

Benchmark Results of LilTii Models vs Qwen Baselines (and other models)

NPM (normalized mean) Bangla MMLU BoolQ-BN CommonsenseQA-BN OpenBookQA-BN PIQA-BN ARC Challenge MMLU HellaSwag TruthfulQA MC1
LilTii v0.2 9.63 0.26 0.61 0.32 0.32 0.61 0.26 0.27 0.32 0.25
Qwen2.5-1.5B 9.36 0.35 0.67 0.23 0.3 0.53 0.23 0.31 0.29 0.29
Qwen2.5-1.5B-Instruct 8.74 0.35 0.67 0.24 0.28 0.52 0.23 0.31 0.29 0.28
Gemma-3-1b-it 8.3 0.31 0.56 0.31 0.32 0.57 0.25 0.28 0.3 0.28
Qwen3-0.6B-Base 8.07 0.33 0.66 0.23 0.31 0.53 0.23 0.3 0.29 0.27
Titulm-llama-3.2-3b-v2.0 7.94 0.25 0.54 0.33 0.35 0.6 0.25 0.26 0.31 0.26
Llama-3.2-1B-Instruct 7.74 0.3 0.63 0.23 0.34 0.53 0.25 0.28 0.29 0.27
Gemma-3-1b-pt 7.73 0.25 0.57 0.32 0.32 0.58 0.25 0.27 0.3 0.27
Gemma-2-2b 7.73 0.32 0.6 0.28 0.33 0.56 0.24 0.25 0.28 0.25
Qwen3-0.6B 6.28 0.29 0.62 0.24 0.3 0.53 0.23 0.29 0.29 0.25
Qwen2.5-0.5B 6.04 0.31 0.58 0.22 0.31 0.54 0.23 0.28 0.28 0.28
LilTii v0.1 5.82 0.24 0.52 0.3 0.33 0.59 0.24 0.26 0.3 0.24
Llama-3.2-1B 5.71 0.28 0.57 0.23 0.32 0.53 0.24 0.28 0.29 0.28
Qwen2.5-0.5B-Instruct 5.48 0.31 0.56 0.22 0.3 0.53 0.23 0.29 0.29 0.28
BanglaLLama-3.2-1b-v0.0.1 4.28 0.26 0.53 0.24 0.31 0.53 0.24 0.27 0.29 0.28
Titulm-llama-3.2-1b-v2.0 4.11 0.25 0.5 0.27 0.32 0.57 0.23 0.24 0.29 0.24
BanglaLLama-3.2-3b-v0.0.3 2.86 0.33 0.54 0.2 0.29 0.5 0.24 0.25 0.26 0.24
Goldfish-bengali-1000mb 2.79 0.25 0.51 0.25 0.3 0.54 0.24 0.25 0.27 0.26

To make it easier to see how these benchmarks stack up, we use can plot the NPM progress.

aggregate_npm

As one can see, by the time it reached ~230B tokens, we got an NPM score of 9.63—well ahead of both Qwen baselines.

These results back up our main idea: a carefully designed multi-stage training approach that mixes Bengali with high-quality English can really boost performance on Bengali benchmarks—even using way less compute than huge multilingual models. That said, some tasks—especially ones needing lots of factual knowledge—still require that extra scale to shine. So while we can do better on some benchmarks with a smaller, more focused training approach, there are still areas where the best way to go is to build on top of a larger multilingual model, and bootstrap your way up from there.

Correlation Analysis: Do Benchmarks Reflect Early Progress?

Beyond raw scores, these results let us ask a more subtle question:

  • Do these benchmarks reliably reflect model improvement during early pretraining?

One simple way to check this is to look at correlation: As we train on more data, does performance on a benchmark steadily improve?

Ideally, we want a monotonic relationship — more training data → better benchmark performance. If that holds, the benchmark is a good progress signal (see Penedo et al. 2025).

Why does this matter? Because if we're training under tight data constraints, and we rely on benchmarks that behave randomly during early training, we might draw the wrong conclusions about (1) how capable the model actually is and (2) whether our training data is high quality.

To explore this, we computed the Spearman correlation between benchmark performance and the amount of training data used.

Spearman Correlation Between Benchmark Performance and Training Data Volume

Benchmark Spearman Correlation v1 (Simple) Spearman Correlation v2 (Multi-Stage)
HellaSwag 0.92 0.97
PIQA-BN 0.79 0.85
ARC Challenge -0.32 0.77
BoolQ-BN -0.52 0.74
MMLU -0.35 0.61
CommonsenseQA-BN 0.44 0.59
TruthfulQA -0.15 0.59
OpenBookQA-BN 0.12 0.52
Bangla MMLU -0.34 0.21

The first big takeaway: in v1 (simple recipe), correlations are sometimes inconsistent or even negative. In v2 (multi-stage), they become consistently positive — and often strong — across almost all benchmarks. This is probably because the multi-stage setup is longer, and thus gives the model more time to learn and show steady improvement on benchmarks that are sensitive to training data quality and quantity.

HellaSwag, PIQA-BN, and ARC Challenge show very high Spearman correlations in the multi-stage setup. Interestingly, these are also the benchmarks where we beat the Qwen baselines. That suggests these tasks are highly data-sensitive — they respond quickly to improvements in training — and they work well as early indicators of model quality. For smaller models or partially trained checkpoints, these benchmarks seem especially useful.

Another reassuring finding: adding high-quality English data didn't hurt Bengali benchmarks. If anything, cross-lingual transfer was neutral or beneficial.

For MMLU, TruthfulQA, and OpenBookQA-BN, correlations turn positive in v2 but remain moderate (0.61, 0.59, and 0.52). This partly supports the idea that these benchmarks need large-scale pretraining before they really move beyond near-random behavior. But the key difference is that in v2, we at least see a consistent upward trend — something that wasn't present in v1.

So even at smaller scale, multi-stage training helps establish real progress on harder tasks. Moreover, using high quality English data in a curriculum design can help us in situations where we have limited data, and scale is constrained.

Generation Examples

To get a feel for what our LilTii models can actually generate, we've put together some sample outputs in the table below.

Generation Examples from LilTii Models and Qwen Baselines

Model Prompt Generated Completion
Qwen/Qwen2.5-0.5B বাংলাদেশের রাজধানী হলো মর্যাদা শুরু আনুষ্ঠানিক পরিচালক, যা নির্মাণের সাথে দেবী পর্যায়ন করেন। রাজধানী বিশিষ্ট ইতিবাচ্ছিত পরিচালক সময় হলো ফোনে শুরু হতে চেয়ে চেয়ে চেয়ে চেয়ে চেয়েছিল। রাজধানী অবস্থিত খালিকা হলো মর্যাদার রাজধানীর খালিকা। রাজধানী অবস্থ ...
বাংলাদেশের প্রধান নদী হলো সিনামূল্যান্ড গ্রুপ নদী। সিনামূল্যান্ড গ্রুপ নদী ছিলেন যে সিনামূল্যান্ড গ্রুপ নদী ছিলেন ও সিনামূল্যান্ড গ্রুপ নদী ছিলেন নিশ্বনের প্রধান নদী। সিনামূল্যান্ড গ্রুপ নদী ছিলেন যে সিনামূল্যান্ড গ্রুপ নদী ছিলেন ও সিনামূল্যান্ড গ্রুপ নদ ...
Qwen/Qwen3-0.6B বাংলাদেশের রাজধানী হলো কোনটি?\n\nবাংলাদেশের রাজধানী হলো মোহাম্মাদ পূর্বাবৃহক। এই জেলায় বিশ্বের বিভিন্ন প্রাচীন জাতীয় শাসকদ্বয় যথা পুরুষ হামীদ এবং গ্রস্ত জীবী আসমান বা মোহাম্মাদ রয়েছে। আসমান এদেশের প্রথম হামীদীয় শাসক, আসমান এর ছোট হামীদীয় উপজেলা ...
বাংলাদেশের প্রধান নদী হলো কী?\nA. সূর্য নদী\nB. দিল্লি নদী\nC. গুম্বার নদী\nD. মহাদেব নদী\nAnswer Choices: (A) A & B only (B) A & C only (C) B & C only (D) All the above\nThe correct answer is:\n\n**D. All the above**\n\nExplanation:\n- **B. দিল্লি নদী** is a river in Bangladesh.\n- **C. গুম্বার নদী** is also a river in Bangladesh.\n- **D. মহাদেব নদী** is a river in India, not Bangladesh.\nSo, all the rivers mentioned (B, C, and D) are in Bangladesh ...
LilTii (v1) বাংলাদেশের রাজধানী হলো ঢাকা। কিন্তু এই ঢাকার একটি নাম আছে, যেটির সাথে জড়িয়ে রয়েছে অনেক ইতিহাস ও ঐতিহ্য । ঢাকা বাংলাদেশের সবচেয়ে বড় শহর এবং এটি দক্ষিণ এশিয়ার মধ্যে দ্বিতীয় বৃহত্তম নগরী হিসেবে পরিচিত৷ এর আয়তন ১ লক্ষ ৪৭ হাজার ৫ শত বর্গকিলোমিটার যা প্রায় বাংলাদেশর সমান ৷ এখানে বসবাস করে বিশ্বের বিভিন্ন দেশের মানুষ তবে বেশিরভাগই আসে পৃথিবীর অন্যান্য দেশ থেকে যেমন ভারত পাকিস্তান নেপাল শ্রীলঙ্কা ইত্যাদি দেশগুলো হতে আগত মানুষের সংখ্যা বেশি হয়ে থাকে তাই বলা যায় এদেশে প্রচুর পরিমাণে বিদেশী নাগরিকের আগমন ঘটেছে যারা এদেশের শিক্ষা সংস্কৃতিতে অবদান রেখে চলেছে প্রতিনিয়ত যার ফলে আমাদের দেশে গড়ে উঠেছে অসংখ্য বিশ্ববিদ্যালয় যেখানে উচ্চ শিক্ষার জন্য বিদেশীরা এসে পড়াশোনা করছে আর এজন্যেই হয়তোবা একে ‘বিশ্ববিদ্যালয়' নামে ডাকা হয় কেননা এখানকার প্রতিটি শিক্ষার্থীর মাঝে মিশে গিয়েছে নিজ মাতৃভাষার প্রতি ভালোবাসা যেটা তাদের একাডেমিক পড়াশোনার ক্ষেত্রে প্রভাব ফেলে বলে আমি মনে করি কারণ তারা জানে বাংলা ভাষাটা কতটা গুরুত্বপূর্ণ একটা বিষয় সেখানে যদি কোনো ...
বাংলাদেশের প্রধান নদী হলো পদ্মা ও ব্রহ্মপুত্র। নদীর উৎপত্তিস্থল হিমালয় পর্বতমালার কৈলাশ শৃঙ্গের কাছে তিব্বতের মানস সরোবর হ্রদ থেকে । আর বাংলাদেশে প্রবেশ করে ফেনী জেলার মুহুরী নামে কুমিল্লা জেলায় এসে তিতাস নাম ধারণ করেছে এই দুই উপনদীই মিলিত হয়ে মেঘনা তৈরি হয়েছে এবং এর প্রবাহে রয়েছে অসংখ্য ছোট-বড় খাল বিল হাওর বাওড় এমনকি গ্রাম নগর জনপদ, শিল্প কারখানা সবই আছে এ নদীতে।। মানচিত্রের মাধ্যমে বাংলাদেশের নদ -নদীর অবস্থান দেখানো হল: ...
LilTii (v2) বাংলাদেশের রাজধানী হলো ঢাকা। বাংলাদেশের বিভাগীয় শহরগুলোর মধ্যে অন্যতম হলো খুলনা, রাজশাহী ও চট্রগ্রাম ।\n- বাংলাদেশ এর মোট আয়তনের প্রায় ৫৬ শতাংশই সমুদ্র সমতল থেকে মাত্র ১ মিটার (৩ ফুট) উচ্চতায় অবস্থিত এবং উত্তর পূর্ব অংশ জুড়ে রয়েছে ভারতের সাথে সীমান্ত যা দেশের চার ভাগের একভাগ এলাকা দখল করেছে| অন্য তিন দিকে স্থল বেষ্টিত হওয়ায় এই অংশের ভূপ্রকৃতি মূলত পাহাড়ী অঞ্চলের মত উঁচু নিচু ভূমির উপর গড়ে উঠেছে যেখানে গাছপালা খুব কম ফলে দিনের বেশিরভাগ সময় সূর্যের আলো থাকে না বললেই চলে৷ বঙ্গোপসাগর উপকূলে বিস্তৃত উপকূলীয় বনভূমি আছে যার বেশির ভাগ ম্যানগ্রোভ জাতীয়; এগুলো ঝড় প্রতিরোধ করতে পারে বলে ধারণা করা হয় ৷ সুন্দরবনকে ১৯৮৭ সালে ইউনেস্কো বিশ্ব ঐতিহ্যবাহী স্থান হিসেবে ঘোষণা করে ...
বাংলাদেশের প্রধান নদী হলো পদ্মা। নদীর উৎপত্তি হিমালয় পর্বতে এবং এর দৈর্ঘ্য ১,৫০০ কিলোমিটার (৯৩৫ মা)। এটি বাংলাদেশের উপর দিয়ে প্রবাহিত হয়ে বঙ্গোপসাগরে গিয়ে মিশেছে।[২] পদ্মা ও যমুনার মিলিত প্রবাহ পদ্মার নাম পেয়েছে বলে ধারণা করা হয়; যদিও এই মিলনের সঠিক প্রমাণ পাওয়া যায়নি[৩][৪], তবু পণ্ডিতদের অনুমান এটুকু যে পূর্ব-পশ্চিমদিকে গতিশীল যমুনা ছিল একটি একক বৃহৎ স্রোত যা দক্ষিণ দিকে অগ্রসর হতে হতে গঙ্গা নদীতে এসে পড়েছিল বলেই এটির নামকরণ হয়েছিল 'যমুনা' নামে । অন্যদিকে গ্রিক পুরাণ মতে দেবী রেমেফিসের পুত্রের বংশধর হিসাবে আদিগঙ্গার তীরে গড়ে ওঠা এক আর্য জনজাতির উপনিবেশ থেকে জন্ম নিয়েছিল ‘আর্যান' বা পুণ্যতোয়া হিসেবে খ্যাত গঙ্গার অপরূপা ধারাটি - যার শাখা প্রশাখার সমন্বয়ে গঠিত হয়েছে বর্তমান কালিন্দী৷ এই দুই ধারার মিলনস্থলটিই আজকের বাংলাদেশে অবস্থিত| বাংলাদেশ অংশে পদ্মায় পানির গড় গভীরতা ৫.৭৮ মিটার অথবা ১৮ ফুটের সামান্য বেশি হলেও ভারত বিভাগের পর থেকেই পলি জমে ক্রমশ তা হ্রাস পেয়ে আসছে – এখন প্রায় ৩ মিঃ অর্থাৎ ১০ ফুট পর্যন্ত নিচে নেমে গেছে ৷ তাই বর্তমানে বর্ষাকালে পানি থাকে মাত্র ৩০০ সেমি.(১ গজ)এর মতো! ফলে তখনকার বিখ্যাত প্রম ...

Looking at the examples in the table, you can really see how different models handle text generation. The smaller Qwen models (0.5B–0.6B) often go off the rails, making repetitive or nonsensical text that doesn't really finish the prompts. Qwen3-0.6B is an interesting case—it sticks to a structured Q&A format with multiple-choice answers and explanations, which explains why it does well on benchmarks like MMLU and Bangla MMLU that use that style.

The LilTii models, on the other hand, show big improvements. They're more factually accurate, stay on topic, and produce coherent narratives. LilTii v1 tends to be a bit wordy but rich in context, while LilTii v2 is more concise and still accurate—like correctly naming “ঢাকা” and “পদ্মা.” Overall, these examples show how far we've come, from messy, low-resource outputs to fluent, content-aware generations.

Compute Vs Performance

It's also worth thinking about how much resources a model needs versus how well it performs. This can include things like training time, energy use, or even carbon emissions. To keep track of energy and emissions, we used the CodeCarbon library (Courty et al. 2024) during training. The table below sums up training time, energy consumption, CO2 emissions, and other compute stats for both versions of our LilTii models.

Training Resource Consumption for LilTii Models

Model Total Duration (hours) GPU-hours Energy consumed (kWh) CO₂ emitted (KgCO₂eq) Total Compute (FLOPs)
LilTii (v1) ~93 ~748 383.77 146.20 ~3.6e20
LilTii (v2) ~214.66 ~1712 874.37 333.09 ~8.28e20

The figure below shows how training compute (in FLOPs) relates to performance (NPM) for our LilTii models versus the Qwen baselines.

Performance vs Compute for LilTii Models vs Qwen Baselines

performance_vs_compute

In short, our LilTii models match or beat the Qwen baselines while using way less compute. For example, our multi-stage v2 model hits a 9.63 NPM score using about 8.28e20 FLOPs, while Qwen3-0.6B only reaches 8.07 with roughly 1.29e23 FLOPs. To put it in perspective, Qwen3-0.6B uses about 156 times more compute than our multi-stage model and still scores lower. If this approach works for other low-resource languages, it suggests you could build 156 smaller, language-specific models for each big multilingual model like Qwen3-0.6B—and get better benchmark results for each language.

Conclusion

LilTii shows that you don't need massive compute to build a competitive language model for a low-resource language (at least at a very small scale!). By combining careful corpus curation with a multi-stage, curriculum-style training recipe that blends Bengali and high-quality English data, we trained a 0.6B-parameter model that outperforms similarly sized multilingual baselines like Qwen2.5-0.5B and Qwen3-0.6B on our Bengali evaluation suite, while using roughly 156× less compute.

That said, there are real limitations worth being upfront about. Our benchmark suite is still limited in scope—several tasks are translated from English, which may not fully capture Bengali-specific language understanding. On knowledge-intensive tasks like Bangla MMLU and MMLU, LilTii still falls short of larger multilingual models, reflecting the fundamental challenge of packing world knowledge into a small, compute-constrained model. Additionally, while our training data was filtered for toxicity, no pipeline is perfect, and some noise likely remains. Finally, LilTii is a base model: it hasn't been instruction-tuned or aligned, which limits its usefulness as a practical assistant out of the box.

Looking ahead, there are several promising directions. The most natural next step is post-training—supervised fine-tuning and preference optimization—to turn LilTii into a useful Bengali assistant. Scaling up the corpus (we're already past the knowledge cut-off of August 2025) and exploring larger model sizes would help address the knowledge gap we see on MMLU-style benchmarks. It would also be worth investigating how much better we can get if we abandon the pretraining approach entirely and directly perform continual pretraining on top of a stronger base model like Qwen3-0.6B, using a similar multi-stage curriculum design. This would, of course, introduce different trade-offs and challenges, but we are happy to keep exploring and sharing our findings.

Resources

All of our models, datasets, and code are publicly available. If you want to build on this work—whether to extend the corpus, fine-tune LilTii, or adapt the pipeline for another language—you can find everything in our Hugging Face collection:

👉 collections/Polygl0t/liltii

You can also find all the source code used for this lil project in:

👉 github.com/Polygl0t

Acknowledgments

LilTii was developed as part of Polyglot (Polygl0t). The methodology and findings presented here extend to additional language-specific studies conducted within the same framework, including Portuguese (e.g., the Tucano 2 series) and Hindi (e.g., LilMoo). For further details on these parallel efforts and associated resources, please refer to the Polyglot project page: huggingface.co/Polygl0t.

Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.

We also gratefully acknowledge the granted access to the Marvin cluster hosted by University of Bonn along with the support provided by its High Performance Computing & Analytics Lab.

Citation

@misc{fatimah2026liltii,
  title={{LilTii: A 0.6B Bengali Language Model that Outperforms Qwen}},
  author={Shiza Fatimah and Aniket Sen and Sophia Falk and Florian Mai and Lucie Flek and Nicholas Kluge Corr{\^e}a},
  year={2026},
  howpublished={\url{https://hf.co/blog/Polygl0t/liltii}}
}

Community

Sign up or log in to comment