How to turn off thinking mode

#86
by Gierry - opened

I know there are three thinking modes, but in some scenarios the Low mode is still too slow for me to output.
May I ask if there is a way like Qwen that has a no_think mode?

How to turn on the thinking mode?

There is no way to turn off reasoning, how effor you can control the amount of effort by specifying Reasoning effort - it can be either low, medium or high.

A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"

A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"

That's a hit!

A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"

How does it impact the model performance?

Yeah, that’s the hacky way to fully kill reasoning mode in GPT-OSS. But just keep in mind, in their paper they mention the model was post-trained with CoT-RL, which means it always uses reasoning (with variable effort: low/med/high). So turning it off like this isn’t really how the model was designed to work.

I haven’t benchmarked it myself, but I think it will decrease performance a lot — not just on reasoning-heavy stuff (math, coding, logic), but even on simpler tasks, since the model was never trained to operate without reasoning.

I Have uploaded that in my Profile, Full ready to install!

I appreciate If wanna try 😊

A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"

How does it impact the model performance?

I think it will just show the final message / response from the model and hide the reasoning. So it won't affect the performance i guess.

Hi. I'm happy to join this thread.
I fine tuned GPT OSS 20B with a technique useful to maintain its naive reasoning capability. It worked.

In my case, I need to process 570k prompts and, with naive thinking enabled, it takes about 147s (mean) to complete a generation (2 NVIDIA H100, MPX4 quantization activated, batched inference with 8 batch size).
Timings drastically drops to ~30s when I applied @anhnmt workaround. Then, I looked at the outcomes. I did not set up any quantitative benchmark, but I saw a consistent performance degradation.

In the fine tuning scenario, an idea could be to train the model computing loss on reasoning tokens also (that's done by default).
With consistent training, this would lead to direct generation and thinking removal. I see a caveat: what would happen in this hypothetical scenario is that the LoRA adapter (assuming to train following PEFT approach) would be trained to accomplish two things:

  • Produce the expected output, useful in my scenario.
  • Forget the capability to reason.
    BUT! Since that the base model weights are frozen, the LoRA adapter would suffer from learning two things together instead of focusing on the main task only.

TL;DR: I decided to accept longer processing times, prioritising quality over timing.

I would like to hear more on this topic. I'm interested in finding the sweet spot between effectiveness and efficiency.

Hi,

Just a quick note on my latest post here. I set up a proper benchmark to quantify performances of my fine tuned GPT OSS 20B with and without thinking.
Surprisingly, I was wrong. I didn't notice any degradation. Indeed, my model kept the same performances.

Probably, this is due to fine tuning set up. Actually, I don't know if these considerations are still valid in base model inference scenario.
Anyway, I thought it was a good idea to share this insight.

Sign up or log in to comment