GPU-accellerated Sandbox Support

The hypothesis is that certain agents need access to NVIDIA GPU hardware (and associated drivers) from within a sandbox. The exact requirements may depend on the task that the agent is performing. Examples that would require GPU access would be developing CUDA kernels, or fine-tuning a model.

With the current architecture, the GPU support is provided by the compute driver and part of this task includes mapping the options exposed for sandbox creation to the relevant set of driver options. The GPU-specific options include:

* When starting a sandbox with `--gpu` a driver-specific GPU default GPU configuration is applied.
* When starting a sandbox with one or more `--gpu-device=ID flags, the specified (driver-specific) device IDs are made available to the sandbox. The driver determines whether multiple devices can be specified, or whether specific device IDs are supported.
* When starting a sandbox with a --gpu-count=N flag the driver selects N GPUs for injection. If this is not possible (e.g. because the number of GPUs is less than N) an error is returned.

The drivers that are considered in-scope for this work (i.e. that support NVIDIA GPU requests) are the docker, podman, kubernetes, and vm drivers.

The broad tasks for this are as follows:

* Ensure that the driver-specific GPU request behaviour is to select a single (free) GPU to inject into the sandbox. This is already implemented in the vm driver and by definition in the kubernetes driver. The Podman and Docker drivers would need to be updated to also select a free GPU instead of defaulting to nvidia.com/gpu=all.
Open question: How should multiple requests for the same resources be handled? Should exclusivity be considered a driver property, or should a user be able to specify this when creating a sandbox?
* Document (through tests) GPU support in the vm and docker drivers.
* Ensure that the nvidia-smi tests can be run against both drivers.
* Define basic workload tests that test other components of the NVIDIA GPU drivers (e.g. basic compute operations).
* Ensure that the tests defined above can be run against all GPU-enabed drivers.

Tasks that are not specific to the implementation include:
* Define a more robust specification of resource requirements so as to allow for driver selection and driver-specific configuration metadata.
* Identify compelling use cases for GPU sandbox (roll into blog posts)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-accellerated Sandbox Support #1444

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

GPU-accellerated Sandbox Support #1444

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions