Why does this double my PP and improve TG?

#9
by gtkunit - opened

When I use the default suggested offload tensor or --n-cpu-moe options I get about 90 PP and 5 TG (everything else identical).
However, when I set --tensor-split 1,0,0, making sure with --override-tensor I have enough space in CUDA0 I get: 218 PP and 6.2 TG.
This gives me better results than using the graph split features. I suppose it is because all regular (non-exp) layers are offloaded in a straight line? Wisdom dispersion would be appreciated!

Heya happy new years!

Hrmm, glad you are experimenting and finding the best command for your specific hardware. It sounds like you have three GPUs then and are using hybrid CPU + 3x GPUs?

This gives me better results than using the graph split features.

Do you mean -sm graph feature here?

I suppose it is because all regular (non-exp) layers are offloaded in a straight line?

I'd have to know your rig specs better to try to speculate here. e.g. how much RAM, NUMA config, GPUs, PCIe speeds, are you compling with nccl available and P2P enabled correctly etc.

It is pretty complex, the best way is to try a lot of things and benchmark with llama-sweep-bench and see what works best for your specific workload.

In general though, yes you want all of the attn/shexp/first N dense layers fully offloaded into VRAM. Then only routed expert layers on RAM for CPU.

In my own testing, keeping them "in a straight line" or keeping them on the same GPU as the attn etc doesn't make a huge difference unless your PCIe speeds are slow perhaps.

Also in general, you can see how important it is to dial in your rig as bestas possible given it can make a big difference!

Cheers!

Happy new years πŸ˜€ and thanks for the reply!
I have a consumer board with 3x3090 and 128GB DDR4.

PCIe setup is horrid: 3@4x 3@1x and 4@16x
I tried REBAR a long time ago and IIRC all boards supported it and it worked, but it didn't improve performance for me. I haven't looked into flashing the BIOS and enabling P2P as I was planning on installing nvlink but I haven't gotten around to it. I usually stick to ExLlama and performance is sufficient there, but I do have a passion for trying out new models (managed to run Deepseek R1 (was it?) at 0.5 TG when it first came out haha).

Yes, I meant -sm graph. I just saw it in your README here πŸ’ .

I haven't tried nccl yet I think. I suppose my build options are a bit dated by now:
-DGGML_CUDA=ON DGGML_SCHED_MAX_COPIES=1 -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_CUDA_F16=ON -DGGML_BLAS=OFF -DCMAKE_CUDA_ARCHITECTURES="native" -DGGML_NATIVE=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DLLAMA_SERVER_SQLITE3=ON

Oh and yes I usually run many dozens of sweep benches. It's very meditative. πŸ˜…

The 3x1 card would have been your bottleneck for prompt processing. Is your ts 1,0,0 putting everything on the 3x4 or the 4x16? If the former, try adjusting it to 0,0,1 and you might see improved pp

It's putting it on the 4x16. I've tried it on the 3x1 and PP went down to the 80~90 range.
I've also run some tests with nccl but haven't seen improvements.

I'm happy with the performance as it is; just always tweaking and benchmarking. It'd make me happy if I could help others with the testing here though, I can't be the only one with such a jank setup.

If I run -sm graph with custom -ts and -ot I get: /mnt/xyz/src/ik_llama.cpp/ggml/src/ggml.c:6084: GGML_ASSERT(nhave > 1) failed but it works fine with --n-cpu-moe. I doubt it's worth looking in to as performance gains with my hardware are likely small (even with ExLlama TP gives me only a Β±10% increase in TG IIRC).

One thing that did help (with highly deterministic tasks) with GLM 4.6 was loading a/the GLM 4.5 draft finetune, but with the latest main of ik_llama it seems like the draft model isn't even loading anymore.

Here's how I run it now:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
CUDA_DEVICE_ORDER=PCI_BUS_ID
CUDA_VISIBLE_DEVICES=2,0,1
./build_bf16/bin/llama-sweep-bench
--model /mnt/uvw/models/glm-4.7-ubergarm/GLM-4.7-IQ3_KS-00001-of-00005.gguf
-ctk q8_0
-ctv q5_1
-c 32768
--batch-size 4096
--ubatch-size 4096
-ngl 999
--tensor-split 1,0,0
-ot "blk.($(seq -s '|' 0 14)).ffn.=CUDA1"
-ot "blk.($(seq -s '|' 15 26)).ffn.
=CUDA2"
--override-tensor exps=CPU
--threads 6
--threads-batch 12
--temp 1.0
--top-p 0.95
--top-k 40
--min-p 0
--warmup-batch
--no-mmap

Sign up or log in to comment