Nsight Compute Profiling PoC¶
This PoC validates the current profiling pipeline end to end:
- a
CUDAExperimentwithspec.profilingEnabled: truecreates a profiling Job; - the profiling Job stages the workload executable from the workload image;
- the profiling runner launches that executable under Nsight Compute;
- the runner publishes a profile summary ConfigMap;
- the controller creates a
WorkloadProfileand writes parsed metrics intostatus; - a later reconcile observes the profile and creates the execution Job.
The parser and classification are still intentionally small. The important PoC
signal is that the metrics come from a real ncu run, not from synthetic data.
Images¶
Build the controller:
make docker-build IMG=kratos-controller:profiling-poc
kind load docker-image kratos-controller:profiling-poc --name kratos-gpu
make deploy IMG=kratos-controller:profiling-poc
kubectl rollout restart -n kratos-system deployment/kratos-controller-manager
kubectl rollout status -n kratos-system deployment/kratos-controller-manager
Build the profiling runner:
cd test/nsight-compute-poc
make build
make load CLUSTER=kratos-gpu
make build uses Dockerfile.runtime, based on
nvidia/cuda:12.4.1-base-ubuntu22.04, and installs only the runtime pieces
needed by the PoC: Nsight Compute CLI, Python, kubectl, the parser, and
/scripts/profile.sh.
Test Experiment¶
Apply the CRDs/RBAC if they are not already deployed:
kubectl apply -f config/crd/bases/gpu.scheduler.io_workloadprofiles.yaml
kubectl apply -f config/rbac/profiling_runner_configmap_role.yaml
kubectl apply -f config/rbac/profiling_runner_configmap_role_binding.yaml
Create or reapply a profiling experiment:
apiVersion: gpu.scheduler.io/v1alpha1
kind: CUDAExperiment
metadata:
name: cuda-vector-add-validation
namespace: default
spec:
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
command:
- /cuda-samples/vectorAdd
runtimeClassName: nvidia
replicas: 1
gpuRequired: 1
profilingEnabled: true
For a clean rerun:
kubectl delete workloadprofile cuda-vector-add-validation-profile --ignore-not-found
kubectl delete configmap cuda-vector-add-validation-profile-summary --ignore-not-found
kubectl delete job cuda-vector-add-validation-profiling --ignore-not-found
kubectl delete job cuda-vector-add-validation-execution --ignore-not-found
kubectl apply -f tmp/cudaexperiment-profiling-validation.yaml
Check the pipeline:
kubectl get jobs
kubectl get pods
kubectl logs job/cuda-vector-add-validation-profiling
kubectl get configmap cuda-vector-add-validation-profile-summary -o yaml
kubectl get workloadprofile cuda-vector-add-validation-profile -o yaml
kubectl get cudaexperiment cuda-vector-add-validation -o yaml
Expected markers:
- profiling Job reaches
Complete; - logs include
ncu --version,Profiling "vectorAdd", andTest PASSED; - logs include real raw metrics such as
sm__throughputandlts__throughput; - ConfigMap
<experiment>-profile-summarycontainssummary.json; WorkloadProfile.status.boundTypeandWorkloadProfile.status.metricsare populated.
Example result from the local kratos-gpu cluster:
status:
boundType: unknown
metrics:
achievedOccupancy: "78.21"
l2Throughput: "30.22"
smThroughput: "12.56"
unknown is acceptable for the NVIDIA vectorAdd sample with the current
thresholds. The PoC goal is real metric capture and controller propagation.
Current Caveats¶
- The controller currently watches
CUDAExperimentand owned Jobs. It does not watchWorkloadProfileyet, so after a profile is created a later reconcile is needed to create the execution Job. For manual validation:
bash
kubectl annotate cudaexperiment cuda-vector-add-validation \
validation.gpu.scheduler.io/reconcile-at=manual --overwrite
- The runner needs GPU performance counter access. The PoC grants
SYS_ADMINto the profiling container. A production setup should replace that with an explicit cluster security policy. - The Nsight package is large, around 594 MB. Keep the built image loaded in kind during validation.
ncu --set basicis used because it works with Nsight Compute 2024.1.1 and collects enough metrics for this phase.