KV Caching problem with gemma 3

zzzz567 · February 17, 2026, 1:38pm

Hi I have fine-tuned a gemma 3 270M model with unsloth.

I am now trying to implement a caching mechanism for my system prompt. Went through HF doc and particularly this section.

However, I am getting the following the error when running:

from transformers.cache_utils import StaticCache
import torch
from pathlib import Path
import json
from unsloth import FastLanguageModel
import time
import copy

model_gemma, tokenizer_gemma = FastLanguageModel.from_pretrained(
    model_name = "gemma_3_lora", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = 2048,
    load_in_4bit = False,
)

PROMPT_SYSTEM = """
### Instruction:
Extract metadata from the document text below.
You must output a VALID JSON object. Do not output lists, markdown, or conversational text.

Required Keys:
- "document_number": The drawing ID
- "document_title": The main title
- "document_revision": Revision code (e.g., C01)
- "document_date": YYYY-MM-DD
"""


PROMPT_INPUT = """
### Input:
{context}

### Response:
"""

cache_sys = StaticCache(config=model_gemma.config, max_cache_len=1024, device=model_gemma.device, dtype=model_gemma.dtype)
inputs_sys = tokenizer_gemma(PROMPT_SYSTEM, return_tensors = "pt").to("cuda")

with torch.no_grad():
    cache_sys = model_gemma(
        **inputs_sys,
        past_key_values=cache_sys
    ).past_key_values

text_input_template = tokenizer_gemma.apply_chat_template(
            [
            {"role" : "system", "content": PROMPT_SYSTEM},
            {"role" : "user", "content": PROMPT_INPUT.format(context="This is some fake data")}
            ],
            tokenize = False,
            add_generation_prompt = True).removeprefix('<bos>')

tokened_input_text = tokenizer_gemma(text_input_template, return_tensors = "pt").to("cuda")

past_key_values = copy.deepcopy(cache_sys)

outputs = model_gemma.generate(
    **tokened_input_text,
    temperature = 1, 
    top_p = 0.95, 
    top_k = 64,
    past_key_values=past_key_values
)
input_length = tokened_input_text["input_ids"].shape[1]
generated_tokens = outputs[:, input_length:]
response_variable = tokenizer_gemma.decode(generated_tokens[0], skip_special_tokens=True)

I get the following error from the generate function

ValueError: Passing both \`cache_implementation\` (used to initialize certain caches) and \`past_key_values\` (a Cache object) is unsupported. Please use only one of the two.

I have tried to pass cache_implementation as None but still the same error

I am using transformers==4.57.6.

John6666 · February 17, 2026, 4:42pm

Gemma 3 family models have some quirk…

What is causing your `ValueError`

That error is thrown by transformers when both of these are true at generation time:

generation_config.cache_implementation is not None (Transformers will try to initialize/manage a cache itself), and
you pass past_key_values as a Cache object (e.g., StaticCache), meaning you are managing the cache.

Transformers explicitly rejects that combination and raises exactly your message. (GitHub)

Why this happens “by default” with Gemma 3

Many Gemma 3 model repos ship a generation_config.json with:

"cache_implementation": "hybrid"

So even if you never set cache_implementation in code, model.generation_config.cache_implementation starts as "hybrid". (Hugging Face)

That’s why generate(past_key_values=StaticCache(...)) immediately errors: you’re passing a user cache while the model’s generation config says “use hybrid cache”. (GitHub)

Why “I tried `cache_implementation=None`” often still fails with Unsloth

Unsloth commonly patches model.generate with a fast wrapper (unsloth_base_fast_generate) that overwrites caching settings. In the wrapper, Unsloth computes a cache type ("static" / "hybrid" / None) and then sets:

kwargs["generation_config"].cache_implementation = cache_implementation (if a generation_config is passed), or
kwargs["cache_implementation"] = cache_implementation (otherwise),

then calls self._old_generate(...). (GitHub)

So even if you set cache_implementation=None, the wrapper can set it back to "static"/"hybrid", and Transformers will again see both cache_implementation and past_key_values → same ValueError. (GitHub)

Solutions (pick one)

Solution 1 (recommended for prefix-caching): disable Unsloth fast generation

Set before importing unsloth:

import os
os.environ["UNSLOTH_DISABLE_FAST_GENERATION"] = "1"

Unsloth documents this flag. (unsloth.ai)

Then you can safely do:

generation_config.cache_implementation = None
pass past_key_values=StaticCache(...)

Solution 2: bypass the wrapper and call the original generate

If Unsloth already patched generate, it calls self._old_generate(...) internally. You can call that directly so Unsloth doesn’t re-inject cache_implementation. (GitHub)

Pattern:

gen = model_gemma._old_generate if hasattr(model_gemma, "_old_generate") else model_gemma.generate
gen(**inputs, generation_config=gen_cfg, past_key_values=cache, ...)

Solution 3: stop using a user cache (use only `cache_implementation`)

If you remove past_key_values=StaticCache(...) and rely purely on cache_implementation, Transformers will manage caching internally. This avoids the error, but it does not give you system-prompt prefix reuse across independent requests in the same way as a persisted prefilled cache.

Additional issue in your specific code: the cached prefix does not match your chat prompt

You prefill the cache with:

inputs_sys = tokenizer_gemma(PROMPT_SYSTEM, ...)

But later you generate from a prompt produced by apply_chat_template(...) containing a system message + user message. Those tokens are not the same prefix as the raw PROMPT_SYSTEM string.

For prefix caching to be correct/beneficial, the cached prefix must match the exact token sequence at the start of the later prompt. Hugging Face’s prefix caching example works because the “INITIAL_PROMPT” is exactly the prefix of the later prompts. (Hugging Face)

Fix: prefill using the same chat template with only the system message, then generate from system+user.

Another important Gemma 3 constraint: “unprocessed input_ids” with caches

Gemma 3’s model docs state that when past_key_values are used, the user is expected to pass only the unprocessed input_ids (tokens not already covered by the cache). (Hugging Face)

Transformers generate() usually slices inputs appropriately, but you can still run into edge cases (especially if your cached prefix and the prompt don’t align, or if the “suffix length” becomes zero).

Minimal “fix shape” for your original snippet

Key changes:

make generation_config.cache_implementation=None for the call
prefill using chat template system-only
disable Unsloth fast generation or call _old_generate

import os
os.environ["UNSLOTH_DISABLE_FAST_GENERATION"] = "1"  # must be set before importing unsloth :contentReference[oaicite:9]{index=9}

import copy, torch
from transformers.cache_utils import StaticCache
from unsloth import FastLanguageModel

model, tok = FastLanguageModel.from_pretrained("gemma_3_lora", max_seq_length=2048, load_in_4bit=False)
model.eval()

# 1) Build SYSTEM-only prefix using chat template (token-exact prefix)
sys_inputs = tok.apply_chat_template(
    [{"role":"system","content": PROMPT_SYSTEM}],
    add_generation_prompt=False,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

# 2) Prefill cache with that prefix
cache = StaticCache(config=model.config, max_cache_len=2048, device=model.device, dtype=model.dtype)
with torch.no_grad():
    cache = model(**sys_inputs, past_key_values=cache, use_cache=True).past_key_values

# 3) Build FULL prompt (system + user)
full_inputs = tok.apply_chat_template(
    [{"role":"system","content": PROMPT_SYSTEM},
     {"role":"user","content": PROMPT_INPUT.format(context="This is some fake data")}],
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

# 4) Ensure cache_implementation is None when passing a Cache object :contentReference[oaicite:10]{index=10}
gen_cfg = copy.deepcopy(model.generation_config)
gen_cfg.cache_implementation = None

# 5) Generate (unpatched generate, because fast generation disabled)
out = model.generate(**full_inputs, generation_config=gen_cfg, past_key_values=copy.deepcopy(cache), max_new_tokens=128)

Summary of “causes → fixes”

Cause A: Gemma 3 repos often set "cache_implementation": "hybrid" in generation_config.json (Hugging Face)
Fix: pass a GenerationConfig where cache_implementation=None when using past_key_values (GitHub)
Cause B: Unsloth fast generation wrapper can forcibly set cache_implementation (static/hybrid) before calling the original generate (GitHub)
Fix: UNSLOTH_DISABLE_FAST_GENERATION=1 (unsloth.ai) or call _old_generate (GitHub)
Cause C (logic/prefix mismatch): you cached raw PROMPT_SYSTEM tokens but generated from chat-template tokens
Fix: prefill cache using system-only apply_chat_template(...) so the cached tokens exactly match the start of later prompts (Hugging Face)

system · February 18, 2026, 10:22am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pipeline Llama3 Text Generation Saving a Memory/Cache Beginners	9	2456	January 5, 2025
Model.generate use_cache=True generates different results than use_cache=False Intermediate	3	497	March 4, 2025
Why i can't use or can't pass past_key_values = DynamicCache() into Llama 3 model Intermediate	1	386	October 8, 2024
Outputs change if re-using KVCache (past_key_values) for model.forward and generation 🤗Transformers	5	496	January 22, 2025
How to cache common instruction prompt 🤗Transformers	16	3216	October 31, 2024