Operator¶
CUDAExperiment is the user-facing resource. Users describe the workload
container, command, GPU request, runtime class, and whether KRATOS should profile
before normal execution.
Current Reconciliation Flow¶
When spec.profilingEnabled is false:
- ensure
<experiment>-executionexists; - record
ExecutionJobCreatedin status.
When spec.profilingEnabled is true:
- check for
WorkloadProfilenamed<experiment>-profile; - if the profile is missing, ensure
<experiment>-profilingexists; - when the profiling Job completes, read
<experiment>-profile-summaryConfigMap; - create/update
<experiment>-profileand write parsed metrics intoWorkloadProfile.status; - on a later reconcile, observe the profile and ensure
<experiment>-executionexists.
The profiling and execution Jobs are mutually exclusive until a profile exists.
Profiling Job¶
The profiling Job keeps the workload image independent from Nsight Compute:
stage-workloadinitContainer usesspec.image, verifiesspec.command[0], copies the CUDA executable into a sharedemptyDir, and exits.profiling-runneruseskratos-nsight-compute-poc:latestby default, requestsnvidia.com/gpu, launches the staged executable underncu --set basic, imports raw metrics, parses a summary, and publishes a ConfigMap.
The runner image can be overridden with KRATOS_NSIGHT_COMPUTE_IMAGE on the
controller manager.
For the NVIDIA sample container use:
spec:
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
command:
- /cuda-samples/vectorAdd
runtimeClassName: nvidia
gpuRequired: 1
replicas: 1
profilingEnabled: true
Outputs¶
Profiling writes:
- Job:
<experiment>-profiling - ConfigMap:
<experiment>-profile-summary - WorkloadProfile:
<experiment>-profile
Example WorkloadProfile.status from a real Nsight run:
status:
boundType: unknown
metrics:
achievedOccupancy: "78.21"
l2Throughput: "30.22"
smThroughput: "12.56"
Inspect:
kubectl logs job/<experiment>-profiling
kubectl get configmap <experiment>-profile-summary -o yaml
kubectl get workloadprofile <experiment>-profile -o yaml
kubectl get cudaexperiment <experiment> -o yaml
Current Caveat¶
The controller watches CUDAExperiment and owned Jobs. It does not yet watch
owned WorkloadProfile resources. After the controller creates a profile,
another reconcile is needed before the execution Job is created. For manual
validation:
kubectl annotate cudaexperiment <experiment> \
validation.gpu.scheduler.io/reconcile-at=manual --overwrite
The next controller step should add a WorkloadProfile watch or explicitly
requeue after successful profile creation.
Longer-Term Direction¶
The current controller only validates the profiling pipeline. Future scheduling work should use workload profiles, GPU telemetry, node constraints, and Volcano integration to produce placement hints before execution.