Daniel Nashed's Blog

Domino on Linux/Unix, Troubleshooting, Best Practices, Tips and more ...

Daniel Nashed

Tags

LLM

CUDA

Apple Silcon

Simple LLM performance test for different GPU and CPU hardware

Daniel Nashed – 26 April 2025 23:13:48

The LLAMA C++ project is the base for many LLM solutions.
It is also the base for Ollama. Picking the right hardware can be quite tricky.

The project has a simple performance test. I took a quick look picking a small model which runs quite good even on modern CPU hardware.
With modern CPU hardware you can ge quite some OK performance for small models for a few parallel requests.

The following is what I tested today on different hardware.
Modern Intel CPUs are good for simple testing.

Apple Silicon is already better. A modern NVIDA CPU -- even on a notebook - has way better performance.

Small enterprise GPUs have better performance even for small LLMs.

Of course for larger models it is also a matter of GPU RAM.
But it n this case it's the pure compute performance compared in this simple test.

This simple test shows already a direction what type of performance you can expect in general with modern hardware.
I might redo the test with older GPUs, AMD CPUs and older NVIDIA cards just to get an idea.

But I would also like to hear from your experience. Specially if you have access to NIVIDA enterprise hardware like a H100.

I gave up formatting this richtext. The Domino blog template is unbelievable broken. I should move but I don't want to loose my blog history.
When saving the document it is always messed up again. It is removing and adding new lines in a very weird way when saved.

-- Anyhow here is the text --

All the test have been performed on Linux with this simple command: ./llama-bench -m qwen2.5-0.5b-instruct-q3_k_m.gguf

Hosted server Intel Xeon Processor (Icelake)

model	size	params	backend	ngl	test	t/s
qwen2 1B Q3_K - Medium	406.35 MiB	630.17 M	RPC	99	pp512	196.84 ± 0.50
qwen2 1B Q3_K - Medium	406.35 MiB	630.17 M	RPC	99	tg128	64.26 ± 0.20

Proxmox 12th Gen Intel(R) Core(TM) i9-12900HK

model	size	params	backend	ngl	test	t/s
qwen2 1B Q3_K - Medium	406.35 MiB	630.17 M	RPC	99	pp512	352.61 ± 25.50
qwen2 1B Q3_K - Medium	406.35 MiB	630.17 M	RPC	99	tg128	130.07 ± 2.84

Apple M4

model	size	params	backend	threads	test	t/s
qwen2 1B Q3_K - Medium	406.35 MiB	630.17 M	Metal,BLAS,RPC	4	pp512	2294.79 ± 45.80
qwen2 1B Q3_K - Medium	406.35 MiB	630.17 M	Metal,BLAS,RPC	4	tg128	150.85 ± 2.63

NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	test	t/s
qwen2 1B Q3_K - Medium	406.35 MiB	630.17 M	CUDA	99	pp512	16827.08 ± 228.85
qwen2 1B Q3_K - Medium	406.35 MiB	630.17 M	CUDA	99	tg128	288.66 ± 1.67

NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	test	t/s
qwen2 1B Q3_K - Medium	406.35 MiB	630.17 M	CUDA	99	pp512	21249.40 ± 86.94
qwen2 1B Q3_K - Medium	406.35 MiB	630.17 M	CUDA	99	tg128	363.51 ± 1.70

Comments [0]

Domino on Linux/Unix, Troubleshooting, Best Practices, Tips and more ...

Daniel Nashed

Daniel Nashed's Blog

Simple LLM performance test for different GPU and CPU hardware

Recent Entries

Feeds

Links

Archives