RAM usage, Model streaming or alternatives

RAM has gotten very expensive and I have some inferencing tasks that are not time sensitive. Unfortunately, some models with a lot more parameters are just a little little bit too large for my RAM. I have 32GB RAM + 16GB VRAM like I suppose a lot of people have. I am looking for some general knowledge and some concrete advice regarding this matter. I am on Windows and LM Studio at the moment but I am open to swich to something else (like Ollama), if the benefits are worth it.

1. Some models are just a little bit bigger than my RAM (like Nemotron 3 Nano 30B A3B Q8_0 at 33.59GB, for example) and of course the operating system will also need a little bit of RAM. LM Studio says this will fit, with “partial GPU offload possible”. This would easily fit into the 32+16= 48 GB of fast memory I have but as far as I have heard, the model has to fit into the RAM, is this correct? Another possible explanation would be the page file, which (so far) is automatically managed by windows. At the moment, it says “Currently allocated 32489MB” and Windows Task Manager says 10,2/63,5 committed memory (no model loaded).

2. Some models are deemed as “likely too large” by LM Studio and I seem to be missing the requirements by just a bit: For example, this goes for the Hermes 4 70B Q3_K_L with a size of 37,14GB and even the Q4 Variant (42,52GB) should easily fit into my RAM+VRAM or RAM+paging. Of course I could just try, LM Studio doesn’t claim to know exactly what will happen but how does this work really, and where should I draw the line for “normal” inferencing?

3. Also, it seems like there are ways to go to the extreme by telling your inferencing tool to behave differently that it normally would and accept extreme inferencing times in return.
For example, I have heard that normal GPU offloading means the entire model must fit into the RAM but some layers of the model are also copied to the VRAM. Isn’t it possible to allocate some or all of the VRAM to be used like normal RAM so 48GB (in this case) would be actually usable to store larger models?
Also, I have heard of model streaming, where parts of the model are just streamed from the file. I’m using LM Studio and can’t find such an option. Is this possible with LM Studio or with Ollama, for example?

Maybe there are also other methods, I haven’t heard of? It seems such a shame not to be able to use an entire model family like Hermes 4, just because the smallest variant is 3GB larger than the RAM


1 Like

When using large models, not only do the model weights themselves become heavy, but the RAM consumed by context, KV cache, etc. increases even more significantly.


What “fits in memory” actually means (for GGUF/llama.cpp-style inference)

When you run a local LLM, you’re budgeting for more than just the model file size:

  • Model weights (largest chunk; roughly the GGUF “file size” for a quant).
  • KV cache (grows with context length; can be multiple GB).
  • Runtime overhead (temporary buffers, CUDA/Metal/DirectML workspaces, tokenizer, graphs, fragmentation, etc.).
  • OS + background apps.

On Windows, there are two different “limits” that matter:

  1. Commit limit (RAM + page file). If you exceed this, allocations fail and the load/crash happens. Microsoft describes the system commit limit as RAM + all page files and shows how Task Manager reports “Committed”. (Microsoft Learn)
  2. Physical RAM working set (what’s actually in RAM right now). If you exceed physical RAM a lot, Windows will page heavily → massive slowdowns even if it “still runs”.

1) “Can a model be a bit bigger than my 32GB RAM if I have 16GB VRAM?”

The key point

VRAM is not an extension of system RAM you can just add up as “48GB unified”. On a discrete GPU, VRAM (“local”) and system memory (“non-local”) are different pools; Windows can map/evict GPU allocations, but using non-local/system memory for GPU workloads is far slower and can thrash. (Microsoft Learn)

What “partial GPU offload possible” usually means

Backends in the llama.cpp family can place some weight tensors on the GPU and keep the rest in system RAM (plus KV cache either on GPU or CPU depending on settings). You don’t need “the entire model in RAM” in the simplistic sense—you need the total allocations to stay within commit, and you need enough physical RAM to avoid paging cliffs.

LM Studio specifically talks about GPU offloading by splitting into subgraphs and moving them to the GPU, loading/unloading as needed. (Microsoft Learn)

Your Nemotron example (Q8_0 ≈ 33.6GB)

The file is about 33.6GB for Q8_0 on HF, and smaller quants exist (Q4_K_M ≈ 24.5GB; Q3_K_L ≈ 20.7GB). (Hugging Face)

With 32GB RAM, Q8_0 is right on the edge even before KV cache + overhead. Whether it loads depends on:

  • how much weight can be offloaded into dedicated VRAM,
  • how big your page file / commit limit is,
  • how much overhead and background usage you have,
  • and what context length you use (KV cache).

Why Task Manager shows “10.2 / 63.5 committed”: the 63.5GB is your commit limit (RAM + page file). Windows explicitly defines that relationship. (Microsoft Learn)
So: yes, a model slightly bigger than physical RAM can still load/run if the commit limit allows it—but performance may collapse if it actually needs to page.


2) “Why does LM Studio say Hermes 4 70B Q3/Q4 is ‘likely too large’ even though I have RAM+VRAM (or paging)? Where’s the line?”

Why the warning happens

A 70B model is “weights + KV cache + overhead”. Even if the weights look only slightly above RAM, the KV cache makes it non-slight very quickly.

Hermes-4-70B GGUF sizes (examples):

  • Q3_K_L ~ 37.14GB
  • Q4_K_M ~ 42.52GB
    
and it also has smaller “importance/mixed” quants like IQ3_M ~ 31.94GB, IQ3_XXS ~ 27.47GB, etc. (Hugging Face)

KV cache: the hidden memory eater (concrete numbers)

Hermes-4-70B is based on Llama-3.1-70B. (Hugging Face)
Llama-3.1-70B config is (notably) 80 layers, 64 attention heads, 8 KV heads, hidden size 8192. (Hugging Face)

For fp16 KV cache, memory per token is approximately:

  • head_dim = 8192 / 64 = 128
  • KV bytes/token ≈ 80 layers × 8 kv_heads × 128 × 2 bytes × 2 (K+V)
  • = 327,680 bytes/token ≈ 320 KB/token

That implies KV cache alone is roughly:

  • 2k context: ~0.6 GB
  • 4k: ~1.25 GB
  • 8k: ~2.5 GB
  • 32k: ~10 GB
  • 128k: ~40 GB

So even if you could “barely” fit weights, a bigger context pushes you over.

Where to draw the line (practical “normal inference” guidance)

On 32GB RAM + 16GB VRAM (Windows), for “normal” usability (not minutes/token), a good rule is:

  • Keep the CPU-side resident footprint under ~26–28GB (leaves OS/headroom).
  • Prefer weights that are clearly below RAM, not “barely above”.
  • Treat paging as a last resort—it’s for “it runs eventually”, not “usable”.

For Hermes-4-70B specifically, that typically means:

  • If you insist on 70B: try IQ3_M (~31.9GB) or smaller IQ3_XXS (~27.5GB) first (and keep context modest). (Hugging Face)
  • Q3_K_L (~37GB) and Q4_* (~40–44GB) are generally “paging territory” on 32GB unless you offload a large fraction of weights to VRAM and keep KV/offload choices conservative.

3) “Extreme modes: VRAM-as-RAM? model streaming from file? Does LM Studio/Ollama support it?”

3a) “Can I use VRAM like normal RAM to get 48GB usable?”

Not in the way you want on a discrete GPU.

  • On Windows (WDDM), discrete GPUs have local (VRAM) and non-local (system memory) budgets; system memory can be used for GPU resources, but it’s not equivalent to extra VRAM and is much slower. (Microsoft Learn)
  • “Shared GPU memory” shown in Task Manager is basically system RAM that the GPU may borrow, typically as spillover. This is exactly the scenario that causes dramatic slowdowns.

A related real-world pitfall: LM Studio users have reported cases where memory goes into shared GPU memory instead of dedicated VRAM and causes issues. (GitHub)
LM Studio even added an option specifically to limit offload to dedicated GPU memory to avoid spilling into shared memory (performance cliff). (LM Studio)

3b) “Model streaming from file”

In llama.cpp terms, what people often call “streaming” is memory mapping (mmap):

  • The model file is mapped into the process address space and pages are pulled from disk on demand by the OS.

llama.cpp exposes this explicitly: --mmap / --no-mmap. (manpages.debian.org)

What mmap does and does not do:

  • :white_check_mark: It can reduce up-front load time and avoid copying the entire file immediately.
  • :white_check_mark: It can allow “it loads” even when RAM is tight (because the OS can page).
  • :cross_mark: During generation, if the model touches most weight pages, your working set trends toward “nearly the whole model anyway”.
  • :cross_mark: If you’re over RAM, this often becomes random NVMe reads + page faults → extremely slow.

So: “streaming” is real, but it’s not magic—it often turns into paging.

3c) Ollama: mmap and memory checks

Ollama has a use_mmap concept, but multiple issues indicate:

  • it may not apply to all engine/model types, and
  • it doesn’t necessarily bypass “needs full memory” behavior for some models. (GitHub)
    Also, there are reports that use_mmap=true can have side effects like memory not being released as expected in some scenarios. (GitHub)

So if your primary goal is “squeeze borderline models into memory”, direct llama.cpp (or a tool that exposes llama.cpp flags cleanly) is usually the most controllable.


Concrete best practices for your exact hardware (32GB RAM + 16GB VRAM, Windows)

A) Treat 70B as “special handling”, not default

If you want Hermes-4-70B locally:

  1. Pick a smaller 70B quant first

    • Start with IQ3_M (~31.9GB) or even IQ3_XXS (~27.5GB). (Hugging Face)
    • Avoid Q4_* (~40GB+) on 32GB unless you accept heavy paging.
  2. Keep context modest

    • 2k–4k is the difference between “maybe works” and “falls over” (KV cache grows fast; see numbers above).
  3. Keep KV cache off the GPU if VRAM is tight

    • If weights already consume most of 16GB VRAM, pushing KV there can trigger spill into shared GPU memory (bad cliff).
    • llama.cpp exposes --kv-offload/--no-kv-offload and KV cache quantization (--cache-type-k, --cache-type-v). (manpages.debian.org)
  4. Use KV cache quantization when RAM-bound

    • For example, setting KV cache to q8 or q4 can materially reduce KV memory. (Tradeoff: extra compute/latency; depends on backend/hardware.) (manpages.debian.org)

B) For Nemotron-3-Nano-30B-A3B: use the right quant instead of forcing Q8

You cited Q8_0 (~33.6GB). The same repo lists:

  • Q4_K_M ~24.5GB
  • Q3_K_L ~20.7GB
  • Q6_K ~33.5GB
  • Q8_0 ~33.6GB (Hugging Face)

On 32GB RAM, Q4_K_M is the “runs normally” option; Q8 is “edge/paging” unless you can offload a meaningful chunk of weights to VRAM.

C) Page file: make it a conscious tool, not an accident

If you decide to run borderline models with paging:

  • Put the page file on a fast SSD/NVMe.
  • Ensure the commit limit is comfortably above your worst-case usage (Windows defines commit limit as RAM + page files). (Microsoft Learn)
    This prevents hard allocation failures, but it does not prevent slowdowns.

D) If you stay with LM Studio

  • Enable the setting to avoid spilling into shared GPU memory (limit to dedicated VRAM) when you’re near VRAM capacity. (LM Studio)
  • Use partial offload deliberately (don’t assume VRAM “adds” to RAM; it’s a placement decision).

E) If you’re open to switching tools

For memory-constrained, “I’ll wait longer” workloads:

  • Direct llama.cpp / llama-server: most knobs (mmap, KV offload, KV quantization). (manpages.debian.org)
  • Ollama: simplest UX, but mmap behavior can be inconsistent across engines/models per issues. (GitHub)
  • LM Studio: good GUI; has explicit work on controlling dedicated vs shared GPU memory and supports GPU offload workflows. (LM Studio)

Similar cases / useful issues & references (worth reading)

  • Windows commit/page file mechanics (so Task Manager numbers make sense): (Microsoft Learn)
  • WDDM GPU memory model (why VRAM doesn’t become “extra RAM”): (Microsoft Learn)
  • LM Studio dedicated-vs-shared GPU memory control: (LM Studio)
  • llama.cpp server flags for mmap + KV controls + KV quantization: (manpages.debian.org)
  • Ollama mmap limitations / behavior discussions: (GitHub)
  • Hermes-4-70B GGUF quant size table (includes IQ quants): (Hugging Face)
  • Nemotron-3-Nano-30B-A3B GGUF file sizes (Q3/Q4/Q6/Q8): (Hugging Face)

Bottom line recommendations for you

  1. If you want “normal” inference: stick to ≀30B in Q4-ish (or 70B only in very small IQ quants).
  2. If you want Hermes-4-70B anyway: start at IQ3_M (or smaller), keep context 2k–4k, and be deliberate about KV placement/quantization. (Hugging Face)
  3. Don’t plan on “48GB unified”—plan on careful tensor placement and avoiding shared GPU memory spill. (Microsoft Learn)