Running Featured 69 QED-Nano: Teaching a Tiny Model to Prove Hard Theorems π 69 Who needs 1T parameters? Olympiad proofs with a 4B model
Running 3.75k The Ultra-Scale Playbook π 3.75k The ultimate guide to training LLM on large GPU Clusters
Pramodith/bert-hybrid-sparse-sliding-window-attention Fill-Mask β’ 0.1B β’ Updated Dec 21, 2023 β’ 1