Getting Started¶
KRATOS is a Go/Kubebuilder operator scaffold. It is not a complete deployable GPU platform yet.
Required Tools¶
- Go 1.23 or newer
- Docker or another OCI runtime
- kubectl and kustomize
- Kubebuilder and controller-gen
- Access to a Kubernetes cluster for integration work
The full experiment stack will also need the NVIDIA device plugin, Volcano, Prometheus, Grafana, DCGM Exporter, Nsight Compute, and CUDA workload images.
Local Check¶
git clone git@github.com:chirichexe/kratos.git
cd kratos
make test
The make test target prepares envtest binaries before running controller
tests. If the default Go build cache is not writable, override it:
GOCACHE=/tmp/kratos-go-build-cache make test
Run Locally¶
Install the CUDAExperiment CRD into the current Kubernetes context:
make install
Run the controller from your host:
make run
Apply the sample custom resource from another terminal:
kubectl apply -f config/samples/gpu_v1alpha1_cudaexperiment.yaml
kubectl get cudaexperiments.gpu.scheduler.io
Submit a CUDA Experiment¶
A CUDAExperiment describes the CUDA image to run and the GPU resources needed
by each Job pod. The minimal fields are:
apiVersion: gpu.scheduler.io/v1alpha1
kind: CUDAExperiment
metadata:
name: cuda-vector-add
spec:
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
runtimeClassName: nvidia
replicas: 1
gpuRequired: 1
runtimeClassName defaults to nvidia, which is required by the local GPU
Kind setup so the pod starts through the NVIDIA runtime handler. Set command
and arguments only when the image entrypoint is not enough for your workload.
Submit and inspect an experiment:
kubectl apply -f config/samples/gpu_v1alpha1_cudaexperiment.yaml
kubectl get cudaexperiment cuda-vector-add -o yaml
kubectl get job cuda-vector-add-execution
kubectl get pods -l gpu.scheduler.io/experiment=cuda-vector-add
kubectl logs job/cuda-vector-add-execution
The controller creates a Job named <experiment-name>-execution and records
that name in status.executionJobName. To rerun the same experiment name,
delete the generated Job or create a new CUDAExperiment with a different
name.
Kubebuilder Setup¶
Use this sequence when regenerating the scaffold from a fresh operator setup:
kubebuilder init --domain scheduler.io --repo github.com/chirichexe/kratos
kubebuilder create api --group gpu --version v1alpha1 --kind CUDAExperiment --resource --controller
make manifests
make generate
For a local GPU-enabled lab, see Local GPU Lab.