Checkpointless training on Amazon SageMaker HyperPod eliminates disruptive checkpoint-restart cycles, maintaining forward training momentum despite failures, reducing recovery time from hours to minutes.
- In-Process Recovery: Recover from node failures in minutes without losing training progress by using redundant model copies stored in GPU memory
- Fast Initialization: Accelerate training restarts by bypassing expensive communication (NCCL/Gloo) setup processes
- Smart Data Caching: Pre-load and cache training data batches to eliminate delays when resuming training after failures
- Built-in Redundancy: Leverage distributed optimizer instances for checkpointless recovery
- NeMo Integration: Works seamlessly with PyTorch Lightning and NVIDIA NeMo toolkit for large language model training
| Model | Method | Size | Nodes | Instance | Accelerator | Recipe | Script |
|---|---|---|---|---|---|---|---|
| GPT OSS | Full finetune example | 120b | 16 | p5.48xlarge | GPU H100 | link | link |
| GPT OSS | LoRA-example | 120b | 2 | p5.48xlarge | GPU H100 | link | link |
| Llama3 | Pretrain example | 70b | 16 | p5.48xlarge | GPU H100 | link | link |
| Llama3 | LoRA-example | 70b | 2 | p5.48xlarge | GPU H100 | link | link |
For comprehensive documentation including installation steps, environment setup, configuration options, and detailed usage examples, see the tutorials at Amazon SageMaker HyperPod Checkpointless training..
You can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.
bash launcher_scripts/gpt_oss/run_checkpointless_nemo_gpt_oss_120b_fine_tuning.shAlternatively, you can deploy the training job directly using kubectl:
kubectl apply -f <path_to_config>.yamlkubectl get pods
kubectl logs <pod-name>For detailed installation steps, environment setup, and configuration options, see the tutorials at Amazon SageMaker HyperPod Checkpointless training.
| Component | Version |
|---|---|
| Python | >=3.12 |
| PyTorch | >=2.6.0 |
| NeMo Toolkit | 2.6.0rc0 |
| CUDA | 12.5+ |
| Infrastructure | AWS HyperPod Kubernetes cluster |
| Storage | Shared storage (FSx/NFS) |
See CONTRIBUTING for more information. Note: This repository is temporarily not accepting pull requests.
This project is licensed under the Apache-2.0 License.