Observability

KRATOS uses Prometheus, Grafana, and DCGM Exporter to inspect GPU runtime metrics during local experiments.

This setup is intended for the local nvkind lab. Production clusters should review chart values, storage, ingress, authentication, and retention settings.

GPU Observability Setup

Install the Prometheus stack:

kubectl create namespace monitoring

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  -n monitoring

Install DCGM Exporter:

helm repo add dcgm-exporter https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update

helm install dcgm-exporter dcgm-exporter/dcgm-exporter \
  -n monitoring \
  -f cluster/dcgm-values.yaml

The values file pins DCGM Exporter to the GPU node, uses the NVIDIA runtime, and requests one GPU:

nodeSelector:
  accelerator: nvidia

runtimeClassName: nvidia

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    cpu: 100m
    memory: 128Mi

For nvkind, runtimeClassName: nvidia is required. Without it, the exporter pod may start without the expected NVIDIA container environment and crash.

GPU Node Label

Find the node with GPU capacity:

kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\\.com/gpu

Label that node so DCGM Exporter is scheduled there:

kubectl label node kratos-gpu-worker accelerator=nvidia --overwrite

ServiceMonitor

Prometheus from kube-prometheus-stack selects ServiceMonitors with the release label. Add it to the DCGM ServiceMonitor:

kubectl label servicemonitor dcgm-exporter \
  -n monitoring \
  release=kube-prometheus-stack \
  --overwrite

Verify

Check the monitoring pods and services:

kubectl get pods -n monitoring
kubectl get svc -n monitoring

Port-forward Prometheus:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090

Query GPU metrics:

curl "http://localhost:9090/api/v1/query?query=DCGM_FI_DEV_GPU_UTIL"
curl "http://localhost:9090/api/v1/query?query=DCGM_FI_DEV_FB_USED"
curl "http://localhost:9090/api/v1/query?query=DCGM_FI_DEV_POWER_USAGE"

Expected metrics include GPU utilization, framebuffer memory usage, and power usage. DCGM_FI_DEV_GPU_UTIL can remain 0 if the workload is too short.

DCGM Exporter also exposes raw metrics on its /metrics endpoint.

Access Grafana Locally

kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

Open http://localhost:3000.

Remove

helm uninstall dcgm-exporter -n monitoring
helm uninstall kube-prometheus-stack -n monitoring
kubectl delete namespace monitoring