[PyT][Test] Add xfailing FSDP2 memory leak detection tests by pstjohn · Pull Request #2803 · NVIDIA/TransformerEngine

pstjohn · 2026-03-25T23:48:14Z

Summary

Add xfailing tests that demonstrate two known FSDP2 + FP8 memory issues (te.autocast() and fully_shard doesn't de-allocate quantized weights until the backwards pass #2681, _create_transpose tensor accumulating during FSDP2 with quantized_model_init #2717)
Tests use layer-level forward hooks and backward memory deltas to detect FP8 temporary tensor accumulation, comparing against a bf16 baseline
Includes bf16 control tests that validate the measurement methodology passes cleanly

Issue #2681: FP8 weight copy accumulation during forward

FP8 weight copies created by te.autocast() accumulate across layers (~0.68 MiB/layer excess over bf16 baseline). Detected for all 5 recipes with no_quant_init.

Issue #2717: Transpose cache retained after backward

_create_transpose tensors persist after backward until the next forward frees them (~3 MiB excess over bf16). Detected for DelayedScaling and Float8CurrentScaling with quant_init.

New tests (in `run_fsdp2_mem_leak.py`)

Test	Type	What it checks
`test_bf16_no_excess_forward_memory`	control (PASS)	bf16 per-layer increments are uniform
`test_bf16_no_excess_backward_memory`	control (PASS)	bf16 vs bf16 backward delta shows zero excess
`test_fp8_temp_accumulation_across_layers`	xfail	FP8 per-layer forward increment exceeds bf16
`test_transpose_cache_retained_after_backward`	xfail	FP8 backward delta exceeds bf16 baseline

All FP8 tests parametrized over 5 recipes × {no_quant_init, quant_init}.

Test plan

pytest tests/pytorch/distributed/test_torch_fsdp2.py — all 4 outer tests pass (including existing model and fused_adam tests)
bf16 control tests PASS
FP8 accumulation tests XFAIL for affected configurations
Pre-commit hooks pass

greptile-apps · 2026-03-25T23:50:14Z

Greptile Summary

This PR adds a new test file run_fsdp2_mem_leak.py to the existing FSDP2 test suite, containing four tests that document two known memory issues with FSDP2 + FP8 (issues #2681 and #2717). Two control tests validate the measurement methodology using pure bf16 models, while two xfail tests detect FP8 temporary tensor accumulation during forward and transpose-cache retention after backward, parametrized over five FP8 recipes × two init modes. The outer test_torch_fsdp2.py adds a launcher test following the existing subprocess-via-torchrun pattern.

The three issues flagged in prior review threads (standalone-runner argument passing for control tests, "4-layer" stale comment, and the unused MEASURED_STEPS constant) are all fully resolved in this version.
The recipe_name fixture is correctly provided by the shared conftest.py in the fsdp2_tests directory, so pytest parametrization will work as expected.
The _PARAMETRIZED_TESTS guard in the __main__ block correctly dispatches recipe/init arguments only to the two FP8 tests and calls the two control tests with no arguments.
The tolerance formula 0.1 * abs(avg_increment) + 1024 in the bf16 control test may fire unexpectedly if avg_increment is very small (e.g., near zero), leaving only 1 KiB of slack — this is a minor fragility worth watching but unlikely to cause CI instability in practice.

Confidence Score: 5/5

Safe to merge — all three previously raised concerns are resolved and no new P0/P1 issues found.

All prior review issues (standalone runner crash, stale layer-count comment, unused MEASURED_STEPS) are addressed. The only remaining note is a P2 style suggestion about the tolerance floor in the bf16 control test. Code follows existing patterns, xfail decoration is correct with strict=False, and the conftest.py fixture wiring is sound.

No files require special attention.

Important Files Changed

Filename	Overview
tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py	New test file implementing four memory-leak detection tests; all three prior review issues are resolved. Minor fragility: the bf16 forward-control tolerance degenerates to just 1 KiB when per-layer activation memory averages near zero.
tests/pytorch/distributed/test_torch_fsdp2.py	Adds `test_fsdp2_mem_leak_tests()` following the identical subprocess-via-torchrun pattern used by the other two outer tests; skip conditions and returncode assertion are correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[test_fsdp2_mem_leak_tests\nouter pytest runner] -->|torchrun -m pytest| B[run_fsdp2_mem_leak.py]

    B --> C[test_bf16_no_excess_forward_memory\ncontrol - PASS]
    B --> D[test_bf16_no_excess_backward_memory\ncontrol - PASS]
    B --> E[test_fp8_temp_accumulation_across_layers\nxfail - Issue 2681]
    B --> F[test_transpose_cache_retained_after_backward\nxfail - Issue 2717]

    C --> G[_LayerMemoryTracker hooks\nper-layer forward increments]
    G --> H{max_deviation ≤\n10% avg + 1KiB?}
    H -->|yes| I[PASS]

    D --> J[_measure_backward_memory_delta\nbf16 vs bf16]
    J --> K{abs excess ≤\n256 KiB?}
    K -->|yes| I

    E --> L[bf16 baseline\n_measure_forward_increments]
    E --> M[FP8 model\n_measure_forward_increments]
    L & M --> N{fp8_avg - bf16_avg ≤\n50 KiB/layer?}
    N -->|no - xfail expected| O[XFAIL]

    F --> P[bf16 baseline\n_measure_backward_memory_delta]
    F --> Q[FP8 model\n_measure_backward_memory_delta]
    P & Q --> R{fp8_delta - bf16_delta ≤\n256 KiB?}
    R -->|no - xfail expected| O

_{Reviews (5): Last reviewed commit: "Merge branch 'main' into pstjohn/fsdp2-m..." | Re-trigger Greptile}

tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py

Add tests that demonstrate two known memory issues with FSDP2 + FP8: - Issue NVIDIA#2681: FP8 weight copies created during te.autocast() forward pass accumulate across layers instead of being freed between layers, defeating FSDP2's memory efficiency. Detected by comparing per-layer forward memory increments against a bf16 baseline using layer hooks. - Issue NVIDIA#2717: Transpose cache tensors (_create_transpose) allocated during backward persist until the next forward pass instead of being freed after backward completes. Detected by comparing the backward memory delta (post_bwd - post_fwd) against a bf16 baseline. New tests: - test_bf16_no_excess_forward_memory: control, validates per-layer measurement - test_bf16_no_excess_backward_memory: control, validates backward delta comparison - test_fp8_temp_accumulation_across_layers: xfail, detects NVIDIA#2681 - test_transpose_cache_retained_after_backward: xfail, detects NVIDIA#2717 All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}. Signed-off-by: Peter St. John <pstjohn@nvidia.com>

vthumbe1503 · 2026-04-02T23:32:56Z

LGTM.

vthumbe1503 · 2026-04-02T23:33:02Z

/te-ci L1 pytorch

vthumbe1503

CI is green. Changes LGTM. Hopefully this PR fixes the xfailing tests.

Add tests that demonstrate two known memory issues with FSDP2 + FP8: - Issue #2681: FP8 weight copies created during te.autocast() forward pass accumulate across layers instead of being freed between layers, defeating FSDP2's memory efficiency. Detected by comparing per-layer forward memory increments against a bf16 baseline using layer hooks. - Issue #2717: Transpose cache tensors (_create_transpose) allocated during backward persist until the next forward pass instead of being freed after backward completes. Detected by comparing the backward memory delta (post_bwd - post_fwd) against a bf16 baseline. New tests: - test_bf16_no_excess_forward_memory: control, validates per-layer measurement - test_bf16_no_excess_backward_memory: control, validates backward delta comparison - test_fp8_temp_accumulation_across_layers: xfail, detects #2681 - test_transpose_cache_retained_after_backward: xfail, detects #2717 All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}. Signed-off-by: Peter St. John <pstjohn@nvidia.com> Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>

greptile-apps bot reviewed Mar 25, 2026

View reviewed changes

tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py Outdated Show resolved Hide resolved

tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py Outdated Show resolved Hide resolved

tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py Outdated Show resolved Hide resolved

pstjohn mentioned this pull request Mar 26, 2026

[PyT] Fix FSDP2 memory leaks for FP8 weight workspaces and transpose caches #2805

Open

3 tasks

pstjohn force-pushed the pstjohn/fsdp2-mem-leak-tests branch from 29cd628 to 27a505f Compare March 30, 2026 19:36

vthumbe1503 added 2 commits April 2, 2026 16:33

Merge branch 'main' into pstjohn/fsdp2-mem-leak-tests

bbac97a

Merge branch 'main' into pstjohn/fsdp2-mem-leak-tests

b512771

vthumbe1503 approved these changes Apr 3, 2026

View reviewed changes

vthumbe1503 merged commit 8cf3c16 into NVIDIA:main Apr 3, 2026
10 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803

[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803
vthumbe1503 merged 3 commits intoNVIDIA:mainfrom
pstjohn:pstjohn/fsdp2-mem-leak-tests

pstjohn commented Mar 25, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vthumbe1503 commented Apr 2, 2026 •

edited

Loading

Uh oh!

vthumbe1503 commented Apr 2, 2026

Uh oh!

vthumbe1503 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pstjohn commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issue #2681: FP8 weight copy accumulation during forward

Issue #2717: Transpose cache retained after backward

New tests (in run_fsdp2_mem_leak.py)

Test plan

Uh oh!

greptile-apps bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vthumbe1503 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vthumbe1503 commented Apr 2, 2026

Uh oh!

vthumbe1503 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pstjohn commented Mar 25, 2026 •

edited

Loading

New tests (in `run_fsdp2_mem_leak.py`)

greptile-apps bot commented Mar 25, 2026 •

edited

Loading

vthumbe1503 commented Apr 2, 2026 •

edited

Loading