Domino on Linux/Unix, Troubleshooting, Best Practices, Tips and more ...

alt

Daniel Nashed

Simple LLM performance test for different GPU and CPU hardware

Daniel Nashed – 26 April 2025 23:13:48

The LLAMA C++ project is the base for many LLM solutions.
It is also the base for Ollama. Picking the right hardware can be quite tricky.

The project has a simple performance test. I took a quick look picking a small model which runs quite good even on modern CPU hardware.
With modern CPU hardware you can ge quite some OK performance for small models for a few parallel requests.

The following is what I tested today on different hardware.
Modern Intel CPUs are good for simple testing.

Apple Silicon is already better. A modern NVIDA CPU -- even on a notebook - has way better performance.

Small enterprise GPUs have better performance even for small LLMs.

Of course for larger models it is also a matter of GPU RAM.
But it n this case it's the pure compute performance compared in this simple test.

This simple test shows already a direction what type of performance you can expect in general with modern hardware.
I might redo the test with older GPUs, AMD CPUs and older NVIDIA cards just to get an idea.

But I would also like to hear from your experience. Specially if you have access to NIVIDA enterprise hardware like a H100.



I gave up formatting this richtext. The Domino blog template is unbelievable broken. I should move but I don't want to loose my blog history.

When saving the document it is always messed up again. It is removing and adding new lines in a very weird way when saved.


-- Anyhow here is the text --



All the test have been performed on Linux with this simple command:



./llama-bench -m qwen2.5-0.5b-instruct-q3_k_m.gguf



Hosted server Intel Xeon Processor (Icelake)
model size params backend ngl test t/s
qwen2 1B Q3_K - Medium 406.35 MiB 630.17 M RPC 99 pp512 196.84 ± 0.50
qwen2 1B Q3_K - Medium 406.35 MiB 630.17 M RPC 99 tg128 64.26 ± 0.20



Proxmox 12th Gen Intel(R) Core(TM) i9-12900HK
model size params backend ngl test t/s
qwen2 1B Q3_K - Medium 406.35 MiB 630.17 M RPC 99 pp512 352.61 ± 25.50
qwen2 1B Q3_K - Medium 406.35 MiB 630.17 M RPC 99 tg128 130.07 ± 2.84

Apple M4


model size params backend threads test t/s
qwen2 1B Q3_K - Medium 406.35 MiB 630.17 M Metal,BLAS,RPC 4 pp512 2294.79 ± 45.80
qwen2 1B Q3_K - Medium 406.35 MiB 630.17 M Metal,BLAS,RPC 4 tg128 150.85 ± 2.63


NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes


model size params backend ngl test t/s
qwen2 1B Q3_K - Medium 406.35 MiB 630.17 M CUDA 99 pp512 16827.08 ± 228.85
qwen2 1B Q3_K - Medium 406.35 MiB 630.17 M CUDA 99 tg128 288.66 ± 1.67

NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes


model size params backend ngl test t/s
qwen2 1B Q3_K - Medium 406.35 MiB 630.17 M CUDA 99 pp512 21249.40 ± 86.94
qwen2 1B Q3_K - Medium 406.35 MiB 630.17 M CUDA 99 tg128 363.51 ± 1.70





Links

    Archives


    • [HCL Domino]
    • [Domino on Linux]
    • [Nash!Com]
    • [Daniel Nashed]