Gemma 3 family models have some quirk…
What is causing your ValueError
That error is thrown by transformers when both of these are true at generation time:
generation_config.cache_implementation is not None (Transformers will try to initialize/manage a cache itself), and
- you pass
past_key_values as a Cache object (e.g., StaticCache), meaning you are managing the cache.
Transformers explicitly rejects that combination and raises exactly your message. (GitHub)
Why this happens “by default” with Gemma 3
Many Gemma 3 model repos ship a generation_config.json with:
"cache_implementation": "hybrid"
So even if you never set cache_implementation in code, model.generation_config.cache_implementation starts as "hybrid". (Hugging Face)
That’s why generate(past_key_values=StaticCache(...)) immediately errors: you’re passing a user cache while the model’s generation config says “use hybrid cache”. (GitHub)
Why “I tried cache_implementation=None” often still fails with Unsloth
Unsloth commonly patches model.generate with a fast wrapper (unsloth_base_fast_generate) that overwrites caching settings. In the wrapper, Unsloth computes a cache type ("static" / "hybrid" / None) and then sets:
kwargs["generation_config"].cache_implementation = cache_implementation (if a generation_config is passed), or
kwargs["cache_implementation"] = cache_implementation (otherwise),
then calls self._old_generate(...). (GitHub)
So even if you set cache_implementation=None, the wrapper can set it back to "static"/"hybrid", and Transformers will again see both cache_implementation and past_key_values → same ValueError. (GitHub)
Solutions (pick one)
Solution 1 (recommended for prefix-caching): disable Unsloth fast generation
Set before importing unsloth:
import os
os.environ["UNSLOTH_DISABLE_FAST_GENERATION"] = "1"
Unsloth documents this flag. (unsloth.ai)
Then you can safely do:
generation_config.cache_implementation = None
- pass
past_key_values=StaticCache(...)
Solution 2: bypass the wrapper and call the original generate
If Unsloth already patched generate, it calls self._old_generate(...) internally. You can call that directly so Unsloth doesn’t re-inject cache_implementation. (GitHub)
Pattern:
gen = model_gemma._old_generate if hasattr(model_gemma, "_old_generate") else model_gemma.generate
gen(**inputs, generation_config=gen_cfg, past_key_values=cache, ...)
Solution 3: stop using a user cache (use only cache_implementation)
If you remove past_key_values=StaticCache(...) and rely purely on cache_implementation, Transformers will manage caching internally. This avoids the error, but it does not give you system-prompt prefix reuse across independent requests in the same way as a persisted prefilled cache.
Additional issue in your specific code: the cached prefix does not match your chat prompt
You prefill the cache with:
inputs_sys = tokenizer_gemma(PROMPT_SYSTEM, ...)
But later you generate from a prompt produced by apply_chat_template(...) containing a system message + user message. Those tokens are not the same prefix as the raw PROMPT_SYSTEM string.
For prefix caching to be correct/beneficial, the cached prefix must match the exact token sequence at the start of the later prompt. Hugging Face’s prefix caching example works because the “INITIAL_PROMPT” is exactly the prefix of the later prompts. (Hugging Face)
Fix: prefill using the same chat template with only the system message, then generate from system+user.
Another important Gemma 3 constraint: “unprocessed input_ids” with caches
Gemma 3’s model docs state that when past_key_values are used, the user is expected to pass only the unprocessed input_ids (tokens not already covered by the cache). (Hugging Face)
Transformers generate() usually slices inputs appropriately, but you can still run into edge cases (especially if your cached prefix and the prompt don’t align, or if the “suffix length” becomes zero).
Minimal “fix shape” for your original snippet
Key changes:
- make
generation_config.cache_implementation=None for the call
- prefill using chat template system-only
- disable Unsloth fast generation or call
_old_generate
import os
os.environ["UNSLOTH_DISABLE_FAST_GENERATION"] = "1" # must be set before importing unsloth :contentReference[oaicite:9]{index=9}
import copy, torch
from transformers.cache_utils import StaticCache
from unsloth import FastLanguageModel
model, tok = FastLanguageModel.from_pretrained("gemma_3_lora", max_seq_length=2048, load_in_4bit=False)
model.eval()
# 1) Build SYSTEM-only prefix using chat template (token-exact prefix)
sys_inputs = tok.apply_chat_template(
[{"role":"system","content": PROMPT_SYSTEM}],
add_generation_prompt=False,
return_tensors="pt",
return_dict=True,
).to(model.device)
# 2) Prefill cache with that prefix
cache = StaticCache(config=model.config, max_cache_len=2048, device=model.device, dtype=model.dtype)
with torch.no_grad():
cache = model(**sys_inputs, past_key_values=cache, use_cache=True).past_key_values
# 3) Build FULL prompt (system + user)
full_inputs = tok.apply_chat_template(
[{"role":"system","content": PROMPT_SYSTEM},
{"role":"user","content": PROMPT_INPUT.format(context="This is some fake data")}],
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
# 4) Ensure cache_implementation is None when passing a Cache object :contentReference[oaicite:10]{index=10}
gen_cfg = copy.deepcopy(model.generation_config)
gen_cfg.cache_implementation = None
# 5) Generate (unpatched generate, because fast generation disabled)
out = model.generate(**full_inputs, generation_config=gen_cfg, past_key_values=copy.deepcopy(cache), max_new_tokens=128)
Summary of “causes → fixes”
-
Cause A: Gemma 3 repos often set "cache_implementation": "hybrid" in generation_config.json (Hugging Face)
Fix: pass a GenerationConfig where cache_implementation=None when using past_key_values (GitHub)
-
Cause B: Unsloth fast generation wrapper can forcibly set cache_implementation (static/hybrid) before calling the original generate (GitHub)
Fix: UNSLOTH_DISABLE_FAST_GENERATION=1 (unsloth.ai) or call _old_generate (GitHub)
-
Cause C (logic/prefix mismatch): you cached raw PROMPT_SYSTEM tokens but generated from chat-template tokens
Fix: prefill cache using system-only apply_chat_template(...) so the cached tokens exactly match the start of later prompts (Hugging Face)