Skip to content

[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803

Merged
vthumbe1503 merged 3 commits intoNVIDIA:mainfrom
pstjohn:pstjohn/fsdp2-mem-leak-tests
Apr 3, 2026
Merged

[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803
vthumbe1503 merged 3 commits intoNVIDIA:mainfrom
pstjohn:pstjohn/fsdp2-mem-leak-tests

Conversation

@pstjohn
Copy link
Copy Markdown
Contributor

@pstjohn pstjohn commented Mar 25, 2026

Summary

Issue #2681: FP8 weight copy accumulation during forward

FP8 weight copies created by te.autocast() accumulate across layers (~0.68 MiB/layer excess over bf16 baseline). Detected for all 5 recipes with no_quant_init.

Issue #2717: Transpose cache retained after backward

_create_transpose tensors persist after backward until the next forward frees them (~3 MiB excess over bf16). Detected for DelayedScaling and Float8CurrentScaling with quant_init.

New tests (in run_fsdp2_mem_leak.py)

Test Type What it checks
test_bf16_no_excess_forward_memory control (PASS) bf16 per-layer increments are uniform
test_bf16_no_excess_backward_memory control (PASS) bf16 vs bf16 backward delta shows zero excess
test_fp8_temp_accumulation_across_layers xfail FP8 per-layer forward increment exceeds bf16
test_transpose_cache_retained_after_backward xfail FP8 backward delta exceeds bf16 baseline

All FP8 tests parametrized over 5 recipes × {no_quant_init, quant_init}.

Test plan

  • pytest tests/pytorch/distributed/test_torch_fsdp2.py — all 4 outer tests pass (including existing model and fused_adam tests)
  • bf16 control tests PASS
  • FP8 accumulation tests XFAIL for affected configurations
  • Pre-commit hooks pass

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 25, 2026

Greptile Summary

This PR adds a new test file run_fsdp2_mem_leak.py to the existing FSDP2 test suite, containing four tests that document two known memory issues with FSDP2 + FP8 (issues #2681 and #2717). Two control tests validate the measurement methodology using pure bf16 models, while two xfail tests detect FP8 temporary tensor accumulation during forward and transpose-cache retention after backward, parametrized over five FP8 recipes × two init modes. The outer test_torch_fsdp2.py adds a launcher test following the existing subprocess-via-torchrun pattern.

  • The three issues flagged in prior review threads (standalone-runner argument passing for control tests, "4-layer" stale comment, and the unused MEASURED_STEPS constant) are all fully resolved in this version.
  • The recipe_name fixture is correctly provided by the shared conftest.py in the fsdp2_tests directory, so pytest parametrization will work as expected.
  • The _PARAMETRIZED_TESTS guard in the __main__ block correctly dispatches recipe/init arguments only to the two FP8 tests and calls the two control tests with no arguments.
  • The tolerance formula 0.1 * abs(avg_increment) + 1024 in the bf16 control test may fire unexpectedly if avg_increment is very small (e.g., near zero), leaving only 1 KiB of slack — this is a minor fragility worth watching but unlikely to cause CI instability in practice.

Confidence Score: 5/5

Safe to merge — all three previously raised concerns are resolved and no new P0/P1 issues found.

All prior review issues (standalone runner crash, stale layer-count comment, unused MEASURED_STEPS) are addressed. The only remaining note is a P2 style suggestion about the tolerance floor in the bf16 control test. Code follows existing patterns, xfail decoration is correct with strict=False, and the conftest.py fixture wiring is sound.

No files require special attention.

Important Files Changed

Filename Overview
tests/pytorch/distributed/fsdp2_tests/run_fsdp2_mem_leak.py New test file implementing four memory-leak detection tests; all three prior review issues are resolved. Minor fragility: the bf16 forward-control tolerance degenerates to just 1 KiB when per-layer activation memory averages near zero.
tests/pytorch/distributed/test_torch_fsdp2.py Adds test_fsdp2_mem_leak_tests() following the identical subprocess-via-torchrun pattern used by the other two outer tests; skip conditions and returncode assertion are correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[test_fsdp2_mem_leak_tests\nouter pytest runner] -->|torchrun -m pytest| B[run_fsdp2_mem_leak.py]

    B --> C[test_bf16_no_excess_forward_memory\ncontrol - PASS]
    B --> D[test_bf16_no_excess_backward_memory\ncontrol - PASS]
    B --> E[test_fp8_temp_accumulation_across_layers\nxfail - Issue 2681]
    B --> F[test_transpose_cache_retained_after_backward\nxfail - Issue 2717]

    C --> G[_LayerMemoryTracker hooks\nper-layer forward increments]
    G --> H{max_deviation ≤\n10% avg + 1KiB?}
    H -->|yes| I[PASS]

    D --> J[_measure_backward_memory_delta\nbf16 vs bf16]
    J --> K{abs excess ≤\n256 KiB?}
    K -->|yes| I

    E --> L[bf16 baseline\n_measure_forward_increments]
    E --> M[FP8 model\n_measure_forward_increments]
    L & M --> N{fp8_avg - bf16_avg ≤\n50 KiB/layer?}
    N -->|no - xfail expected| O[XFAIL]

    F --> P[bf16 baseline\n_measure_backward_memory_delta]
    F --> Q[FP8 model\n_measure_backward_memory_delta]
    P & Q --> R{fp8_delta - bf16_delta ≤\n256 KiB?}
    R -->|no - xfail expected| O
Loading

Reviews (5): Last reviewed commit: "Merge branch 'main' into pstjohn/fsdp2-m..." | Re-trigger Greptile

Add tests that demonstrate two known memory issues with FSDP2 + FP8:

- Issue NVIDIA#2681: FP8 weight copies created during te.autocast() forward pass
  accumulate across layers instead of being freed between layers, defeating
  FSDP2's memory efficiency. Detected by comparing per-layer forward memory
  increments against a bf16 baseline using layer hooks.

- Issue NVIDIA#2717: Transpose cache tensors (_create_transpose) allocated during
  backward persist until the next forward pass instead of being freed after
  backward completes. Detected by comparing the backward memory delta
  (post_bwd - post_fwd) against a bf16 baseline.

New tests:
- test_bf16_no_excess_forward_memory: control, validates per-layer measurement
- test_bf16_no_excess_backward_memory: control, validates backward delta comparison
- test_fp8_temp_accumulation_across_layers: xfail, detects NVIDIA#2681
- test_transpose_cache_retained_after_backward: xfail, detects NVIDIA#2717

All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}.

Signed-off-by: Peter St. John <pstjohn@nvidia.com>
@pstjohn pstjohn force-pushed the pstjohn/fsdp2-mem-leak-tests branch from 29cd628 to 27a505f Compare March 30, 2026 19:36
@vthumbe1503
Copy link
Copy Markdown
Collaborator

vthumbe1503 commented Apr 2, 2026

LGTM.

@vthumbe1503
Copy link
Copy Markdown
Collaborator

/te-ci L1 pytorch

Copy link
Copy Markdown
Collaborator

@vthumbe1503 vthumbe1503 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is green. Changes LGTM. Hopefully this PR fixes the xfailing tests.

@vthumbe1503 vthumbe1503 merged commit 8cf3c16 into NVIDIA:main Apr 3, 2026
10 of 12 checks passed
KshitijLakhani pushed a commit that referenced this pull request Apr 6, 2026
Add tests that demonstrate two known memory issues with FSDP2 + FP8:

- Issue #2681: FP8 weight copies created during te.autocast() forward pass
  accumulate across layers instead of being freed between layers, defeating
  FSDP2's memory efficiency. Detected by comparing per-layer forward memory
  increments against a bf16 baseline using layer hooks.

- Issue #2717: Transpose cache tensors (_create_transpose) allocated during
  backward persist until the next forward pass instead of being freed after
  backward completes. Detected by comparing the backward memory delta
  (post_bwd - post_fwd) against a bf16 baseline.

New tests:
- test_bf16_no_excess_forward_memory: control, validates per-layer measurement
- test_bf16_no_excess_backward_memory: control, validates backward delta comparison
- test_fp8_temp_accumulation_across_layers: xfail, detects #2681
- test_transpose_cache_retained_after_backward: xfail, detects #2717

All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}.

Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants