Nsight Compute Profiling PoC¶

This PoC validates the current profiling pipeline end to end:

a CUDAExperiment with spec.profilingEnabled: true creates a profiling Job;
the profiling Job stages the workload executable from the workload image;
the profiling runner launches that executable under Nsight Compute;
the runner publishes a profile summary ConfigMap;
the controller creates a WorkloadProfile and writes parsed metrics into status;
a later reconcile observes the profile and creates the execution Job.

The parser and classification are still intentionally small. The important PoC signal is that the metrics come from a real ncu run, not from synthetic data.

Images¶

Build the controller:

make docker-build IMG=kratos-controller:profiling-poc
kind load docker-image kratos-controller:profiling-poc --name kratos-gpu
make deploy IMG=kratos-controller:profiling-poc
kubectl rollout restart -n kratos-system deployment/kratos-controller-manager
kubectl rollout status -n kratos-system deployment/kratos-controller-manager

Build the profiling runner:

cd test/nsight-compute-poc
make build
make load CLUSTER=kratos-gpu

make build uses Dockerfile.runtime, based on nvidia/cuda:12.4.1-base-ubuntu22.04, and installs only the runtime pieces needed by the PoC: Nsight Compute CLI, Python, kubectl, the parser, and /scripts/profile.sh.

Test Experiment¶

Apply the CRDs/RBAC if they are not already deployed:

kubectl apply -f config/crd/bases/gpu.scheduler.io_workloadprofiles.yaml
kubectl apply -f config/rbac/profiling_runner_configmap_role.yaml
kubectl apply -f config/rbac/profiling_runner_configmap_role_binding.yaml

Create or reapply a profiling experiment:

apiVersion: gpu.scheduler.io/v1alpha1
kind: CUDAExperiment
metadata:
  name: cuda-vector-add-validation
  namespace: default
spec:
  image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
  command:
    - /cuda-samples/vectorAdd
  runtimeClassName: nvidia
  replicas: 1
  gpuRequired: 1
  profilingEnabled: true

For a clean rerun:

kubectl delete workloadprofile cuda-vector-add-validation-profile --ignore-not-found
kubectl delete configmap cuda-vector-add-validation-profile-summary --ignore-not-found
kubectl delete job cuda-vector-add-validation-profiling --ignore-not-found
kubectl delete job cuda-vector-add-validation-execution --ignore-not-found
kubectl apply -f tmp/cudaexperiment-profiling-validation.yaml

Check the pipeline:

kubectl get jobs
kubectl get pods
kubectl logs job/cuda-vector-add-validation-profiling
kubectl get configmap cuda-vector-add-validation-profile-summary -o yaml
kubectl get workloadprofile cuda-vector-add-validation-profile -o yaml
kubectl get cudaexperiment cuda-vector-add-validation -o yaml

Expected markers:

profiling Job reaches Complete;
logs include ncu --version, Profiling "vectorAdd", and Test PASSED;
logs include real raw metrics such as sm__throughput and lts__throughput;
ConfigMap <experiment>-profile-summary contains summary.json;
WorkloadProfile.status.boundType and WorkloadProfile.status.metrics are populated.

Example result from the local kratos-gpu cluster:

status:
  boundType: unknown
  metrics:
    achievedOccupancy: "78.21"
    l2Throughput: "30.22"
    smThroughput: "12.56"

unknown is acceptable for the NVIDIA vectorAdd sample with the current thresholds. The PoC goal is real metric capture and controller propagation.

Current Caveats¶

The controller currently watches CUDAExperiment and owned Jobs. It does not watch WorkloadProfile yet, so after a profile is created a later reconcile is needed to create the execution Job. For manual validation:

bash kubectl annotate cudaexperiment cuda-vector-add-validation \ validation.gpu.scheduler.io/reconcile-at=manual --overwrite

The runner needs GPU performance counter access. The PoC grants SYS_ADMIN to the profiling container. A production setup should replace that with an explicit cluster security policy.
The Nsight package is large, around 594 MB. Keep the built image loaded in kind during validation.
ncu --set basic is used because it works with Nsight Compute 2024.1.1 and collects enough metrics for this phase.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search