Checkpointless training on Amazon SageMaker HyperPod

Checkpointless training on Amazon SageMaker HyperPod eliminates disruptive checkpoint-restart cycles, maintaining forward training momentum despite failures, reducing recovery time from hours to minutes.

Key Features

In-Process Recovery: Recover from node failures in minutes without losing training progress by using redundant model copies stored in GPU memory
Fast Initialization: Accelerate training restarts by bypassing expensive communication (NCCL/Gloo) setup processes
Smart Data Caching: Pre-load and cache training data batches to eliminate delays when resuming training after failures
Built-in Redundancy: Leverage distributed optimizer instances for checkpointless recovery
NeMo Integration: Works seamlessly with PyTorch Lightning and NVIDIA NeMo toolkit for large language model training

Getting Started Examples

Model	Method	Size	Nodes	Instance	Accelerator	Recipe	Script
GPT OSS	Full finetune example	120b	16	p5.48xlarge	GPU H100	link	link
GPT OSS	LoRA-example	120b	2	p5.48xlarge	GPU H100	link	link
Llama3	Pretrain example	70b	16	p5.48xlarge	GPU H100	link	link
Llama3	LoRA-example	70b	2	p5.48xlarge	GPU H100	link	link

User Guide

For comprehensive documentation including installation steps, environment setup, configuration options, and detailed usage examples, see the tutorials at Amazon SageMaker HyperPod Checkpointless training..

Quick Start Guide

Launch Training

Hyperpod Recipe Launcher

You can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

bash launcher_scripts/gpt_oss/run_checkpointless_nemo_gpt_oss_120b_fine_tuning.sh

Launch Using kubectl

Alternatively, you can deploy the training job directly using kubectl:

kubectl apply -f <path_to_config>.yaml

Monitor Job Status

kubectl get pods
kubectl logs <pod-name>

For detailed installation steps, environment setup, and configuration options, see the tutorials at Amazon SageMaker HyperPod Checkpointless training.

Recommended Requirements

Component	Version
Python	>=3.12
PyTorch	>=2.6.0
NeMo Toolkit	2.6.0rc0
CUDA	12.5+
Infrastructure	AWS HyperPod Kubernetes cluster
Storage	Shared storage (FSx/NFS)

Security

See CONTRIBUTING for more information. Note: This repository is temporarily not accepting pull requests.

License

This project is licensed under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src/hyperpod_checkpointless_training		src/hyperpod_checkpointless_training
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Checkpointless training on Amazon SageMaker HyperPod

Key Features

Getting Started Examples

User Guide

Quick Start Guide

Launch Training

Hyperpod Recipe Launcher

Launch Using kubectl

Monitor Job Status

Recommended Requirements

Security

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

aws/sagemaker-hyperpod-checkpointless-training

Folders and files

Latest commit

History

Repository files navigation

Checkpointless training on Amazon SageMaker HyperPod

Key Features

Getting Started Examples

User Guide

Quick Start Guide

Launch Training

Hyperpod Recipe Launcher

Launch Using kubectl

Monitor Job Status

Recommended Requirements

Security

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages