Dr. Chad PhD
AI & ML interests
Recent Activity
Organizations
Tau-Bench looping
LGAI-EXAONE/K-EXAONE-236B-A23B
I do mostly CPT. Should I convert my dataset to SFT and give longer reasonings too for it to have integrity?
Example: Is the yolk of an egg more beneficial or the white? Answer in 100 words.
Answer: Yolk is more beneficial because ..........
Example: Is the yolk of an egg more beneficial or the white? Answer in 500 words.
Answer: White is more beneficial because ..........
Edit: These happen in temp = 0.0
I think SFT would help a lot as you suspected.
The way I see it is that it's actually succeeding at what CPT is good at (pattern matching). Meaning, somewhere in the data set there is data that actually favors White over Yolk and somewhere in your data Yolk is being preferred over White. It doesn't even have the be that obviously defined, but could be indirect.
So what I think happens is this:
Short question (100 words) ===> Matches pattern from Q&A sites and FAQ sections (just as example) ===> This data mentions yolk wins
Long question (500 words) ===> Matches pattern from blog posts and academic articles (also just examples) ===> This data mentions whites wins
So besides cleaning up the data, which is really kind of out of scope because you'd be babysitting your data for every possible length/answer. I think SFT will help.
With SFT it doesn't just learn the patterns but what humans prefer, which is consistency across length. It's basically statistical correlation with CPT vs behavioral alignment with SFT.
There's also a thing called attention drift that you may want to look into, it can be helpful.
Unlike standard llama.cpp quantizations that rely on fixed type heuristics (e.g., Q4_K_M), the Target BPW approach optimizes per-tensor precision where it matters the most, and produces high quality models that meet a precise global file size target.
Key Advantages:
- VRAM Maximization: Can generate high quality models sized exactly to fit hardware constraints (e.g., fitting the model into exactly 24GB VRAM).
- Data-Driven Precision: Quantization mix is determined by actual weight error sensitivity rather than hardcoded rules, often yielding better PPL/KLD size trade-offs.
Full benchmarks (PPL, KLD, ARC, MMLU, etc.) and methodology in the models' cards
eaddario/Apriel-1.6-15b-Thinker-GGUF
eaddario/GLM-4.6V-Flash-GGUF