Operator¶

CUDAExperiment is the user-facing resource. Users describe the workload container, command, GPU request, runtime class, and whether KRATOS should profile before normal execution.

Current Reconciliation Flow¶

When spec.profilingEnabled is false:

ensure <experiment>-execution exists;
record ExecutionJobCreated in status.

When spec.profilingEnabled is true:

check for WorkloadProfile named <experiment>-profile;
if the profile is missing, ensure <experiment>-profiling exists;
when the profiling Job completes, read <experiment>-profile-summary ConfigMap;
create/update <experiment>-profile and write parsed metrics into WorkloadProfile.status;
on a later reconcile, observe the profile and ensure <experiment>-execution exists.

The profiling and execution Jobs are mutually exclusive until a profile exists.

Profiling Job¶

The profiling Job keeps the workload image independent from Nsight Compute:

stage-workload initContainer uses spec.image, verifies spec.command[0], copies the CUDA executable into a shared emptyDir, and exits.
profiling-runner uses kratos-nsight-compute-poc:latest by default, requests nvidia.com/gpu, launches the staged executable under ncu --set basic, imports raw metrics, parses a summary, and publishes a ConfigMap.

The runner image can be overridden with KRATOS_NSIGHT_COMPUTE_IMAGE on the controller manager.

For the NVIDIA sample container use:

spec:
  image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
  command:
    - /cuda-samples/vectorAdd
  runtimeClassName: nvidia
  gpuRequired: 1
  replicas: 1
  profilingEnabled: true

Outputs¶

Profiling writes:

Job: <experiment>-profiling
ConfigMap: <experiment>-profile-summary
WorkloadProfile: <experiment>-profile

Example WorkloadProfile.status from a real Nsight run:

status:
  boundType: unknown
  metrics:
    achievedOccupancy: "78.21"
    l2Throughput: "30.22"
    smThroughput: "12.56"

Inspect:

kubectl logs job/<experiment>-profiling
kubectl get configmap <experiment>-profile-summary -o yaml
kubectl get workloadprofile <experiment>-profile -o yaml
kubectl get cudaexperiment <experiment> -o yaml

Current Caveat¶

The controller watches CUDAExperiment and owned Jobs. It does not yet watch owned WorkloadProfile resources. After the controller creates a profile, another reconcile is needed before the execution Job is created. For manual validation:

kubectl annotate cudaexperiment <experiment> \
  validation.gpu.scheduler.io/reconcile-at=manual --overwrite

The next controller step should add a WorkloadProfile watch or explicitly requeue after successful profile creation.

Longer-Term Direction¶

The current controller only validates the profiling pipeline. Future scheduling work should use workload profiles, GPU telemetry, node constraints, and Volcano integration to produce placement hints before execution.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search