payelb
/

aligned_tinyllama_ultrafeedback_fixed1k_won

+---
+license: apache-2.0
+tags:
+- trl
+- ppo
+- lora
+- alignment
+- reward-modeling
+- ultrafeedback
+base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
+---
+# Aligned TinyLlama on UltraFeedback (fixed-1k prompt pool)
+This model was aligned with **TRL PPO** using a reward model:
+- **payelb/UltraFeedback_openbmb_deberta_1k_fixed_WoN** (tag: `won`)
+Key settings:
+- Prompt pool: restricted to the same fixed/selected 1k subset used for RM training (loaded from CSV)
+- PPO updates: 200
+- batch size: 4
+- lr: 1e-05
+- LoRA: r=16, alpha=32, dropout=0.05