59 33 230

Dr. Chad PhD

Doctor-Chad-PhD

AI & ML interests

😎

Recent Activity

new activity 3 days ago

allura-forge/Llama-3.3-8B-Instruct:Tau-Bench looping

liked a model 3 days ago

upstage/Solar-Open-100B

liked a model 3 days ago

LGAI-EXAONE/K-EXAONE-236B-A23B

View all activity

Organizations

None yet

New activity in allura-forge/Llama-3.3-8B-Instruct 3 days ago

Tau-Bench looping

#2 opened 3 days ago by

Doctor-Chad-PhD

liked 2 models 3 days ago

upstage/Solar-Open-100B

103B • Updated 2 days ago • 1.16k • 279

LGAI-EXAONE/K-EXAONE-236B-A23B

Text Generation • 237B • Updated about 10 hours ago • 921 • 287

liked a model 4 days ago

allura-forge/Llama-3.3-8B-Instruct

8B • Updated 3 days ago • 1.46k • 148

reacted to etemiz's post with 👍 6 days ago

Post

1778

I realized when I ask longer answers to my questions, the models sometimes produce completely opposite answer. What could be the reason?

I do mostly CPT. Should I convert my dataset to SFT and give longer reasonings too for it to have integrity?

Example: Is the yolk of an egg more beneficial or the white? Answer in 100 words.

Answer: Yolk is more beneficial because ..........

Example: Is the yolk of an egg more beneficial or the white? Answer in 500 words.

Answer: White is more beneficial because ..........

Edit: These happen in temp = 0.0

5 replies

replied to etemiz's post 6 days ago

I think SFT would help a lot as you suspected.

The way I see it is that it's actually succeeding at what CPT is good at (pattern matching). Meaning, somewhere in the data set there is data that actually favors White over Yolk and somewhere in your data Yolk is being preferred over White. It doesn't even have the be that obviously defined, but could be indirect.

So what I think happens is this:

Short question (100 words) ===> Matches pattern from Q&A sites and FAQ sections (just as example) ===> This data mentions yolk wins

Long question (500 words) ===> Matches pattern from blog posts and academic articles (also just examples) ===> This data mentions whites wins

So besides cleaning up the data, which is really kind of out of scope because you'd be babysitting your data for every possible length/answer. I think SFT will help.

With SFT it doesn't just learn the patterns but what humans prefer, which is consistency across length. It's basically statistical correlation with CPT vs behavioral alignment with SFT.

There's also a thing called attention drift that you may want to look into, it can be helpful.

reacted to eaddario's post with ❤️ 6 days ago

Post

2060

Experimental global target bits‑per‑weight quantization of ServiceNow-AI/Apriel-1.6-15b-Thinker and zai-org/GLM-4.6V-Flash

Unlike standard llama.cpp quantizations that rely on fixed type heuristics (e.g., Q4_K_M), the Target BPW approach optimizes per-tensor precision where it matters the most, and produces high quality models that meet a precise global file size target.

Key Advantages:
- VRAM Maximization: Can generate high quality models sized exactly to fit hardware constraints (e.g., fitting the model into exactly 24GB VRAM).
- Data-Driven Precision: Quantization mix is determined by actual weight error sensitivity rather than hardcoded rules, often yielding better PPL/KLD size trade-offs.

Full benchmarks (PPL, KLD, ARC, MMLU, etc.) and methodology in the models' cards

eaddario/Apriel-1.6-15b-Thinker-GGUF
eaddario/GLM-4.6V-Flash-GGUF