Skip to content

Commit 3583c94

Browse files
committed
Initial public release
0 parents  commit 3583c94

File tree

158 files changed

+33689
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

158 files changed

+33689
-0
lines changed

.crux_dry_run_build

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
AUTOBUILD

.gitignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
build
2+
dist
3+
*.egg-info
4+
__pycache__
5+
cupti_module.*.so
6+
.pytest_cache
7+
.coverage*
8+
/.hatch
9+
requirements.txt
10+
*.iml
11+
/nemo_experiments/*
12+
/outputs

.pre-commit-config.yaml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
default_language_version:
2+
python: python3
3+
4+
repos:
5+
6+
- repo: https://github.com/PyCQA/isort
7+
rev: 5.13.2
8+
hooks:
9+
- id: isort
10+
exclude: docs/
11+
12+
- repo: https://github.com/psf/black-pre-commit-mirror
13+
rev: 24.10.0
14+
hooks:
15+
- id: black
16+
language_version: python3.10
17+
18+
- repo: https://github.com/astral-sh/ruff-pre-commit
19+
rev: v0.6.9
20+
hooks:
21+
- id: ruff

CONTRIBUTING.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
2+
## Nvidia Resiliency Extension (NVRx) OSS Contribution Rules
3+
4+
#### Issue Tracking
5+
6+
* All enhancement, bugfix, or change requests must begin with the creation of a [NVRx Issue Request](TBD).
7+
* The issue request must be reviewed by NVRx engineers and approved prior to code review.
8+
9+
10+
#### Coding Guidelines
11+
12+
- All source code contributions must follow the existing conventions in the relevant file, submodule, module, and project when you add new code or when you extend/fix existing functionality.
13+
14+
- Avoid introducing unnecessary complexity into existing code so that maintainability and readability are preserved.
15+
16+
- Try to keep pull requests (PRs) as concise as possible:
17+
- Avoid committing commented-out code.
18+
- Wherever possible, each PR should address a single concern. If there are several otherwise-unrelated things that should be fixed to reach a desired endpoint, our recommendation is to open several PRs and indicate the dependencies in the description. The more complex the changes are in a single PR, the more time it will take to review those changes.
19+
20+
- To ensure code consistency and maintainability across the project, please format and lint your code using the following tools before committing any changes:
21+
- We use black to automatically format Python code. It enforces a consistent style by reformatting code according to a set of rules.
22+
- To format your code, run:
23+
```
24+
black .
25+
```
26+
- isort is used to sort and format import statements automatically. Ensure that your imports are ordered correctly by running:
27+
```
28+
isort .
29+
```
30+
- ruff is a fast Python linter that helps catch common issues. Please run ruff to check for and fix linting problems:
31+
```
32+
ruff check .
33+
```
34+
35+
- Write commit titles using imperative mood and [these rules](https://chris.beams.io/posts/git-commit/), and reference the Issue number corresponding to the PR. Following is the recommended format for commit texts:
36+
```
37+
#<Issue Number> - <Commit Title>
38+
39+
<Commit Body>
40+
```
41+
42+
- Ensure that the build log is clean, meaning no warnings or errors should be present.
43+
44+
- Ensure that all unit tests pass prior to submitting your code.
45+
46+
- All OSS components must contain accompanying documentation (READMEs) describing the functionality, dependencies, and known issues.
47+
48+
- See `README.md` for existing samples and plugins for reference.
49+
50+
- All OSS components must have an accompanying test.
51+
52+
- If introducing a new component, such as a plugin, provide a test sample to verify the functionality.
53+
54+
- Make sure that you can contribute your work to open source (no license and/or patent conflict is introduced by your code). You will need to [`sign`](#signing-your-work) your commit.
55+
56+
- Thanks in advance for your patience as we review your contributions; we do appreciate them!
57+
58+
59+
#### Pull Requests
60+
Developer workflow for code contributions is as follows:
61+
62+
1. Developers must first [fork](https://help.github.com/en/articles/fork-a-repo) the [upstream](TBD) NVRx OSS repository.
63+
64+
2. Git clone the forked repository and push changes to the personal fork.
65+
66+
```bash
67+
git clone https://github.com/YOUR_USERNAME/YOUR_FORK.git NVRx
68+
# Checkout the targeted branch and commit changes
69+
# Push the commits to a branch on the fork (remote).
70+
git push -u origin <local-branch>:<remote-branch>
71+
```
72+
73+
3. Once the code changes are staged on the fork and ready for review, a [Pull Request](https://help.github.com/en/articles/about-pull-requests) (PR) can be [requested](https://help.github.com/en/articles/creating-a-pull-request) to merge the changes from a branch of the fork into a selected branch of upstream.
74+
* Exercise caution when selecting the source and target branches for the PR.
75+
Note that versioned releases of NVRx OSS are posted to `release/` branches of the upstream repo.
76+
* Creation of a PR creation kicks off the code review process.
77+
* Atleast one NVRx engineer will be assigned for the review.
78+
* While under review, mark your PRs as work-in-progress by prefixing the PR title with [WIP].
79+
80+
4. Since there is no CI/CD process in place yet, the PR will be accepted and the corresponding issue closed only after adequate testing has been completed, manually, by the developer and/or NVRx engineer reviewing the code.
81+
82+
83+
#### Signing Your Work
84+
85+
* We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
86+
87+
* Any contribution which contains commits that are not Signed-Off will not be accepted.
88+
89+
* To sign off on a commit you simply use the `--signoff` (or `-s`) option when committing your changes:
90+
```bash
91+
$ git commit -s -m "Add cool feature."
92+
```
93+
This will append the following to your commit message:
94+
```
95+
Signed-off-by: Your Name <[email protected]>
96+
```
97+
98+
* Full text of the DCO:
99+
100+
```
101+
Developer Certificate of Origin
102+
Version 1.1
103+
104+
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
105+
1 Letterman Drive
106+
Suite D4700
107+
San Francisco, CA, 94129
108+
109+
Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
110+
```
111+
112+
```
113+
Developer's Certificate of Origin 1.1
114+
115+
By making a contribution to this project, I certify that:
116+
117+
(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
118+
119+
(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
120+
121+
(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.
122+
123+
(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.
124+
```

DEVELOPING.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Developing HyperPodEnginesFaultResiliencyNCCL
2+
3+
Put your notes on developing the package here.
4+
5+
### Using Peru:
6+
7+
* https://builderhub.corp.amazon.com/docs/brazil/user-guide/getting-started-peru.html

LICENSE.txt

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.

Makefile

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
.PHONY: all run local build test install
2+
3+
script = ${PWD}/examples/inprocess/sagemaker/hp_basic_example.py
4+
pp_script = ${PWD}/examples/inprocess/sagemaker/hp_pp_example.py
5+
sqsh ?= ${PWD}/hyperpod-checkpointless-training+latest.sqsh
6+
7+
all: run
8+
9+
run:
10+
${PWD}/examples/inprocess/sagemaker/hprun.sh \
11+
-i $(sqsh) \
12+
$(script) \
13+
--device cuda \
14+
--log-interval 5 \
15+
--path ${PWD} \
16+
--fault-prob 0.001 \
17+
--size 256 \
18+
--layers 16 \
19+
--total-iterations 10000
20+
21+
local:
22+
hprun --nproc_per_node=8 $(script) --device cuda --log-interval 5 --total-iterations 1000
23+
24+
build:
25+
${PWD}/examples/inprocess/sagemaker/build.sh -n hyperpod-checkpointless-training -f ${PWD}/examples/inprocess/sagemaker/Dockerfile
26+
27+
test:
28+
HPWRAPPER_LOG_LEVEL=debug pytest tests/inprocess/sagemaker/ -s
29+
30+
install:
31+
pip install --force-reinstall --no-deps .

README.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# NVIDIA Resiliency Extension
2+
3+
The NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads.
4+
5+
## Core Components and Capabilities
6+
7+
- **Fault Tolerance**
8+
- Detection of hung ranks.
9+
- Restarting training in-job, without the need to reallocate SLURM nodes.
10+
11+
- **In-Process Restarting**
12+
- Detecting failures and enabling quick recovery.
13+
14+
- **Async Checkpointing**
15+
- Providing an efficient framework for asynchronous checkpointing.
16+
17+
- **Local Checkpointing**
18+
- Providing an efficient framework for local checkpointing.
19+
20+
- **Straggler Detection**
21+
- Monitoring GPU and CPU performance of ranks.
22+
- Identifying slower ranks that may impede overall training efficiency.
23+
24+
- **PyTorch Lightning Callbacks**
25+
- Facilitating seamless NVRx integration with PyTorch Lightning.
26+
27+
## Installation
28+
29+
### From sources
30+
- `git clone https://github.com/NVIDIA/nvidia-resiliency-ext`
31+
- `cd nvidia-resiliency-ext`
32+
- `pip install .`
33+
34+
35+
### From PyPI wheel
36+
- `pip install nvidia-resiliency-ext`
37+
38+
### Platform Support
39+
40+
| Category | Supported Versions / Requirements |
41+
|---------------------|----------------------------------------------|
42+
| Architecture | x86_64 |
43+
| Operating System | Ubuntu 22.04 |
44+
| Python Version | >= 3.10, < 3.13 |
45+
| PyTorch Version | 2.3+ |
46+
| CUDA & CUDA Toolkit | 12.5+ |
47+
| NVML Driver | 550 or later |
48+
| NCCL Version | 2.21.5+ |
49+
50+
**Note**: The package is designed to support Python >= 3.10, CUDA >= 11.8, PyTorch >= 2.0 and Ubuntu 20.04, but the recommended and tested environment for production is Python >= 3.10, < 3.13, CUDA 12.5+, and Ubuntu 22.04.
51+
52+
## Usage
53+
54+
For detailed documentation and usage information about each component, please refer to the https://nvidia.github.io/nvidia-resiliency-ext/.

SECURITY.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Security
2+
3+
NVIDIA is dedicated to the security and trust of our software products and services, including all source code repositories managed through our organization.
4+
5+
If you need to report a security issue, please use the appropriate contact points outlined below. **Please do not report security vulnerabilities through GitHub.**
6+
7+
## Reporting Potential Security Vulnerability in an NVIDIA Product
8+
9+
To report a potential security vulnerability in any NVIDIA product:
10+
- Web: [Security Vulnerability Submission Form](https://www.nvidia.com/object/submit-security-vulnerability.html)
11+
12+
- We encourage you to use the following PGP key for secure email communication: [NVIDIA public PGP Key for communication](https://www.nvidia.com/en-us/security/pgp-key)
13+
- Please include the following information:
14+
- Product/Driver name and version/branch that contains the vulnerability
15+
- Type of vulnerability (code execution, denial of service, buffer overflow, etc.)
16+
- Instructions to reproduce the vulnerability
17+
- Proof-of-concept or exploit code
18+
- Potential impact of the vulnerability, including how an attacker could exploit the vulnerability
19+
20+
While NVIDIA currently does not have a bug bounty program, we do offer acknowledgement when an externally reported security issue is addressed under our coordinated vulnerability disclosure policy. Please visit our [Product Security Incident Response Team (PSIRT)](https://www.nvidia.com/en-us/security/psirt-policies/) policies page for more information.
21+
22+
## NVIDIA Product Security
23+
24+
For all security-related concerns, please visit NVIDIA's Product Security portal at https://www.nvidia.com/en-us/security

brazil.ion

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
2+
3+
common::{
4+
name: "FaradayFaultController",
5+
major_version: "1.0",
6+
7+
dependencies: {
8+
default_closure: run,
9+
10+
closures: {
11+
run: public::{
12+
include: [self],
13+
},
14+
},
15+
16+
build_after: [
17+
]
18+
},
19+
build: {
20+
command: "peru-hatch",
21+
22+
env: {
23+
PATH: [
24+
(farm "PeruHatch" "bin"),
25+
(env PATH),
26+
],
27+
},
28+
29+
sequences: [
30+
{
31+
name: 'hyperpod_engines_faraday_sequence',
32+
strategy: commit,
33+
},
34+
],
35+
36+
outputs: {
37+
public_dir: "./build",
38+
private_dir: "./private",
39+
cleaned: [],
40+
}
41+
},
42+
}

0 commit comments

Comments
 (0)