Domino on Linux/Unix, Troubleshooting, Best Practices, Tips and more ...

alt

Daniel Nashed

Monitoring NVIDIA GPUs

Daniel Nashed – 1 December 2024 09:23:02

NVIDIA tools

NIVIDA has great tools and toolkits. My focus right now is mostly Linux. But some of the tools are also available for Windows
Some useful tools are already come with Ubuntu. For example nvtop is a great and simple to use ad-hoc monitoring tool


NVIDA card under load

The NVIDIA RTX 4000 SFF Ada Generation is a small server grade card with 20 GB of RAM.

You can see the fan is already around 50% and the temperature on the card goes up. Power consumption is at 65 of 70 W.
My load test in this example is a simple multi threaded servertask performing LLAMA requests.

You can see all the relevant parameters monitored here. This includes the processes using the GPU. In my case the llama-server you can see at the very bottom of the screen print.


Image:Monitoring NVIDIA GPUs


nvptop is a great tool. But not really helpful in long term monitoring or to see those stats in combination with other stats.
There are some Grafana integrations which either use the NVIDIA toolkit to low level read the stats.


Some projects use the  nvidia-smi to query information about the card. It also allows to print stats like other Linux tools (vmstat, iostat).
Beside the stats it can also print timedate and the card model.

Here is a sample command line and output:

nvidia-smi -l 1 --format=csv --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used
2024/12/01 09:59:36.259, NVIDIA RTX 4000 SFF Ada Generation, 00000000:01:00.0, 565.57.01, P0, 4, 4, 57, 84 %, 67 %, 20475 MiB, 6514 MiB, 13532 MiB
2024/12/01 09:59:37.264, NVIDIA RTX 4000 SFF Ada Generation, 00000000:01:00.0, 565.57.01, P0, 4, 4, 57, 68 %, 39 %, 20475 MiB, 6514 MiB, 13532 MiB
2024/12/01 09:59:38.265, NVIDIA RTX 4000 SFF Ada Generation, 00000000:01:00.0, 565.57.01, P0, 4, 4, 57, 66 %, 39 %, 20475 MiB, 6514 MiB, 13532 MiB
2024/12/01 09:59:39.266, NVIDIA RTX 4000 SFF Ada Generation, 00000000:01:00.0, 565.57.01, P0, 4, 4, 58, 65 %, 41 %, 20475 MiB, 6514 MiB, 13532 MiB
2024/12/01 09:59:40.267, NVIDIA RTX 4000 SFF Ada Generation, 00000000:01:00.0, 565.57.01, P0, 4, 4, 58, 68 %, 40 %, 20475 MiB, 6514 MiB, 13532 MiB
2024/12/01 09:59:41.268, NVIDIA RTX 4000 SFF Ada Generation, 00000000:01:00.0, 565.57.01, P0, 4, 4, 58, 66 %, 43 %, 20475 MiB, 6514 MiB, 13532 MiB

The example above shows the values which can be printed.Not all those parameters make sense to combine.
For example the card model might need to be queried only once.

There are also options to leave out the headers and units: -format=csv,noheader,nounits. Those help to integrate the results into own applications.


Next Steps

Those tools the door for all kind of statistic integrations.
I could for example add those stats to Domino server stats and leverage existing integrations or the out of the box Domino statistics collection.

For now the simple graphic and the command-line is already very helpful to monitor the card.

The nvidia-smi command line is also available on Windows. nvtop is Linux only. But NVIDIA has also Windows based applications to show the status of your cards.



Links

    Archives


    • [HCL Domino]
    • [Domino on Linux]
    • [Nash!Com]
    • [Daniel Nashed]