Simple LLM performance test for different GPU and CPU hardware
Daniel Nashed – 26 April 2025 23:13:48
The LLAMA C++ project is the base for many LLM solutions.
It is also the base for Ollama. Picking the right hardware can be quite tricky.
The project has a simple performance test. I took a quick look picking a small model which runs quite good even on modern CPU hardware.
With modern CPU hardware you can ge quite some OK performance for small models for a few parallel requests.
The following is what I tested today on different hardware.
Modern Intel CPUs are good for simple testing.
Apple Silicon is already better. A modern NVIDA CPU -- even on a notebook - has way better performance.
Small enterprise GPUs have better performance even for small LLMs.
Of course for larger models it is also a matter of GPU RAM.
But it n this case it's the pure compute performance compared in this simple test.
This simple test shows already a direction what type of performance you can expect in general with modern hardware.
I might redo the test with older GPUs, AMD CPUs and older NVIDIA cards just to get an idea.
But I would also like to hear from your experience. Specially if you have access to NIVIDA enterprise hardware like a H100.
I gave up formatting this richtext. The Domino blog template is unbelievable broken. I should move but I don't want to loose my blog history.
When saving the document it is always messed up again. It is removing and adding new lines in a very weird way when saved.
-- Anyhow here is the text --
All the test have been performed on Linux with this simple command:
./llama-bench -m qwen2.5-0.5b-instruct-q3_k_m.gguf
Hosted server Intel Xeon Processor (Icelake)
model | size | params | backend | ngl | test | t/s |
qwen2 1B Q3_K - Medium | 406.35 MiB | 630.17 M | RPC | 99 | pp512 | 196.84 ± 0.50 |
qwen2 1B Q3_K - Medium | 406.35 MiB | 630.17 M | RPC | 99 | tg128 | 64.26 ± 0.20 |
Proxmox 12th Gen Intel(R) Core(TM) i9-12900HK
model | size | params | backend | ngl | test | t/s |
qwen2 1B Q3_K - Medium | 406.35 MiB | 630.17 M | RPC | 99 | pp512 | 352.61 ± 25.50 |
qwen2 1B Q3_K - Medium | 406.35 MiB | 630.17 M | RPC | 99 | tg128 | 130.07 ± 2.84 |
Apple M4
model | size | params | backend | threads | test | t/s |
qwen2 1B Q3_K - Medium | 406.35 MiB | 630.17 M | Metal,BLAS,RPC | 4 | pp512 | 2294.79 ± 45.80 |
qwen2 1B Q3_K - Medium | 406.35 MiB | 630.17 M | Metal,BLAS,RPC | 4 | tg128 | 150.85 ± 2.63 |
NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
model | size | params | backend | ngl | test | t/s |
qwen2 1B Q3_K - Medium | 406.35 MiB | 630.17 M | CUDA | 99 | pp512 | 16827.08 ± 228.85 |
qwen2 1B Q3_K - Medium | 406.35 MiB | 630.17 M | CUDA | 99 | tg128 | 288.66 ± 1.67 |
NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
model | size | params | backend | ngl | test | t/s |
qwen2 1B Q3_K - Medium | 406.35 MiB | 630.17 M | CUDA | 99 | pp512 | 21249.40 ± 86.94 |
qwen2 1B Q3_K - Medium | 406.35 MiB | 630.17 M | CUDA | 99 | tg128 | 363.51 ± 1.70 |
- Comments [0]