Post
123
You're probably training on outdated Wikipedia data right now and don't know it. ๐ก
In June last year, a friend from the Moroccan Wikipedia community slid into my DMs: "Are you using the current version? The official dataset is severely outdated. We added so many articles nowhere to be found on HuggingFace."
He was right. I was running a 2023 snapshot. In 2025. The official Wikipedia dataset, the one hundreds of labs and researchers grab by default without a second thought, was frozen in time.
โข For English, that's 700,000 missing articles.
โข For Moroccan Arabic, 30% of the language's entire Wikipedia.
โข For 31 other languages, there was literally no text corpus at all until recently.
I could've shrugged and moved on. Instead I spent the next months building a monthly automated pipeline for 340+ languages, on my personal laptop, nearly killing it several times in the process (100% disk, frozen screen, the works).
Nous Research trained Hermes 4 on it. INRIA cited it. It's now three years ahead of what most people are training on.
Here's the full story of how I built Wikipedia Monthly ๐
https://omarkamali.com/blog/wikipedia-monthly-pipeline
In June last year, a friend from the Moroccan Wikipedia community slid into my DMs: "Are you using the current version? The official dataset is severely outdated. We added so many articles nowhere to be found on HuggingFace."
He was right. I was running a 2023 snapshot. In 2025. The official Wikipedia dataset, the one hundreds of labs and researchers grab by default without a second thought, was frozen in time.
โข For English, that's 700,000 missing articles.
โข For Moroccan Arabic, 30% of the language's entire Wikipedia.
โข For 31 other languages, there was literally no text corpus at all until recently.
I could've shrugged and moved on. Instead I spent the next months building a monthly automated pipeline for 340+ languages, on my personal laptop, nearly killing it several times in the process (100% disk, frozen screen, the works).
Nous Research trained Hermes 4 on it. INRIA cited it. It's now three years ahead of what most people are training on.
Here's the full story of how I built Wikipedia Monthly ๐
https://omarkamali.com/blog/wikipedia-monthly-pipeline