Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis Paper • 2512.22100 • Published 8 days ago • 2
Bolmo: Byteifying the Next Generation of Language Models Paper • 2512.15586 • Published 17 days ago • 13
FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition Paper • 2512.13884 • Published 19 days ago • 14
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence Paper • 2511.18538 • Published Nov 23, 2025 • 281
view article Article Transformers v5: Simple model definitions powering the AI ecosystem +2 Dec 1, 2025 • 263
Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining Paper • 2511.21613 • Published Nov 26, 2025 • 2
view article Article Building for an Open Future - our new partnership with Google Cloud Nov 13, 2025 • 47
Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements Paper • 2511.05560 • Published Nov 4, 2025 • 1
Pre-training Dataset Samples Collection A collection of pre-training datasets samples of sizes 10M, 100M and 1B tokens. Ideal for use in quick experimentation and ablations. • 19 items • Updated 10 days ago • 18
view article Article The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix Nov 3, 2025 • 53
BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data Paper • 2510.10159 • Published Oct 11, 2025 • 3
Gaperon: A Peppered English-French Generative Language Model Suite Paper • 2510.25771 • Published Oct 29, 2025 • 15