Used quants, but model is not recognized to support tools, though it does

by JLouisBiz - opened 25 days ago

Discussion

JLouisBiz

25 days ago

I don't know details. I have used iquants from mrradermacher

Used quants, but model is not recognized to support tools, though it does. Maybe you need to do some settings.

DedeProGames

Orion LLM Labs org 25 days ago

•

edited 25 days ago

Hello JLouisBiz, thank you for contacting us!
It's likely that the quantized versions of mrradermacher aren't activating the tools for this model.
Which platform are you using? Ollama, LM Studios, if you tell me, we can try to help you!
Anyway, if you're going to use quantization, I recommend using LM Studios, as it's efficient, supports Hugging Face GGUFs, and already has integrated tools!

GRM2 has been optimized for tool use, being one of the most efficient models for tools up to 3b params.

SerialKicked

22 days ago

•

edited 22 days ago

That's because the jinja template in this repo is missing, so I assume that when it got quant'd, the GGUF ended up without a template.

If you're using LMStudio, you can manually set a ChatML template (you can download and copy-paste my template) in the model's settings.

If you're using llama.cpp add "--chat-template ChatML" (or "--chat-template-file [path_to_my_template_file]") to your command-line argument for llama-server.

If you're using ollama, use something else :D

JLouisBiz

22 days ago

I am using llama-server and llama.cpp and for 100+ models I don't use the chat template, but let me try.

/usr/local/bin/llama-server --reasoning-format none --reasoning-budget 0 --jinja -fa on -c 131072 -v --log-timestamps --host 192.168.1.68 --threads 2 --threads-batch 2 --threads-http 4 --batch-size 4096 --ubatch-size 1024 --mlock --mmap --no-warmup --cont-batching -m /mnt/nvme0n1/LLM/quantized/GRM2-3b.i1-Q6_K.gguf --chat-template ChatML

so it is going wild there, nothing happens. Why don't you make GGUF files that work?

DedeProGames

Orion LLM Labs org 22 days ago

Hi JLouisBiz

The GGUFs you are using are not official Orion LLM Labs files, and we do not create GGUFs for our models.

Likely not a GGUF problem. The issue is probably the prompt format in llama.cpp: forcing --chat-template ChatML can break generation if the model wasn’t trained for ChatML, and llama.cpp expects chatml lowercase anyway. If you don’t use chat templates, remove that flag and use raw /v1/completions. If you want chat mode, check /props and use the model’s actual template instead of forcing ChatML.

SerialKicked

22 days ago

•

edited 22 days ago

@DedeProGames What do you mean it's not trained for ChatML?! It has the special tokens for it. im_start, im_end and so on. And you even give a ChatML jinja template in your own "tokenizer_config_search.json" file, except it's missing the tool calling part of the template... And it is in a file that no backend will look for the template in, but whatever.

Do you even know how your own model works?

@JLouisBiz Yeah, that's because normally the authors don't self sabotage their own releases. Just use my file, it works. But given the quality of their support and release, i'd just skip that thing if I were you.

JLouisBiz

22 days ago

I was thinking @SerialKicked knows what to do, so I followed that advice. I was using i quants from @MrRad and now I downloaded his normal quant, now I am getting functional GRM model.

Just that output on long text was repetitive, too many times repetitive sections, different written, repetitie. I have tried with Qwen3.5 4B on same text, and got a coherent text out.

I cannot use this model in this stage, I can try in future again. Thanks much.

DedeProGames changed discussion status to closed 22 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment