[DECOUPLED-MODE] Adding Decoupling Logic #2865

gulsumgudukbay · 2025-12-21T06:49:51Z

Description

This PR is the second part of the decoupling support. It adds logic for decoupling support, along with some test modifications for decoupling to be enabled.

Details:

Update decoupled_base_test.yml
Add decoupling locig to src/MaxText/decode.py, src/MaxText/elastic_train.py, src/MaxText/experimental/rl/grpo_trainer.py, src/MaxText/gcp_workload_monitor.py, src/MaxText/max_utils.py, src/MaxText/maxengine.py, src/MaxText/maxengine_config.py, src/MaxText/maxengine_server.py, src/MaxText/metric_logger.py, src/MaxText/prefill_packing.py, src/MaxText/profiler.py, src/MaxText/sft/hooks.py, src/MaxText/sft/sft_trainer.py, src/MaxText/train.py, src/MaxText/utils/gcs_utils.py, src/MaxText/utils/goodput_utils.py, src/MaxText/vertex_tensorboard.py
Update src/MaxText/gcloud_stub.py to add IS_STUB variables, and add google_cloud_mldiagnostics stub
Update tests to support decoupled mode (add markers, update file paths, make them use decoupled_base_test.yml config file).

Tests

All unit tests pass in decoupled mode.
UT results:
== 306 passed, 170 skipped, 25 deselected, 6588 warnings in 975.16s (0:16:15) ==

Train test:
python -m MaxText.train MaxText/configs/base.yml run_name=test hardware=gpu steps=5 model_name=llama2-7b attention=cudnn_flash_te enable_checkpointing=False ici_expert_parallelism=1 ici_fsdp_parallelism=-1 ici_data_parallelism=1 remat_policy=minimal scan_layers=True dataset_type=synthetic logits_dot_in_fp32=False dtype=bfloat16 weight_dtype=bfloat16 per_device_batch_size=1 max_target_length=2048 shardy=False

works.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

(cherry picked from commit e8cc951)

(cherry picked from commit 0b58e96)

(cherry picked from commit 14f0508)

…ts library (cherry picked from commit 6f0b361)

(cherry picked from commit e43e370)

(cherry picked from commit 1c14d6c)

…ck, todo: remove this after updating jax. Configure ICI data parallelism for decoupled mode

__init__.py

tests/train_smoke_test.py

tests/train_int8_smoke_test.py

SurbhiJainUSC · 2026-01-08T17:28:36Z

tests/train_compile_test.py

 from MaxText.globals import MAXTEXT_PKG_DIR
+from maxtext.tests.test_utils import get_test_config_path
+
+pytestmark = [pytest.mark.tpu_only]


These tests are suppose to run on CPUs. Why are we adding tpu_only marker?

These tests are suppose to run on CPUs. Why are we adding tpu_only marker?

They are supposed to run on CPUs however it requires libtpu to generate the TPU topology. In the case where we do not have libtpu, it errors out.

tests/tfds_data_processing_test.py

tests/test_env_smoke.py

SurbhiJainUSC · 2026-01-13T19:30:17Z

src/MaxText/configs/decoupled_base_test.yml


 # Leave dataset-related keys to be overridden by individual tests.
-dataset_type: ""
+#dataset_type: ""


Is this #dataset_type intentional?

SurbhiJainUSC · 2026-01-13T22:01:19Z

tests/decode_tests.py

+from MaxText.globals import MAXTEXT_ASSETS_ROOT, MAXTEXT_PKG_DIR
+from tests.test_utils import get_test_config_path, get_test_dataset_path, get_test_base_output_directory
+
+pytestmark = [pytest.mark.tpu_only, pytest.mark.external_serving]


There are few tests in this script that runs on gpu only. Can you remove pytest.mark.tpu_only?

SurbhiJainUSC · 2026-01-13T22:01:37Z

tests/decode_tests.py

 class DecodeTests(unittest.TestCase):
  """Tests decode with various configs."""

+  decoupled = is_decoupled()


nit: this is unused.

SurbhiJainUSC · 2026-01-13T22:23:22Z

tests/integration_tests/checkpoint_compatibility_test.py

  run_checkpoint_compatibility("tpu", "autoselected")


+@pytest.mark.external_serving


Should be external_training

SurbhiJainUSC · 2026-01-13T22:26:38Z

src/MaxText/train.py

+
+  In decoupled mode (DECOUPLE_GCLOUD=TRUE) cloud diagnostics may be stubbed; if so, skip wrapping.
+  """
+  if is_decoupled() or getattr(diagnostic, "__class__", None).__name__ == "_StubDiag":  # runtime skip


Instead, you can use contextlib.nullcontext to conditionally apply the diagnostic wrapper while keeping a single, clean call to your training loop.

SurbhiJainUSC · 2026-01-13T22:27:32Z

src/MaxText/train.py

  vertex_tensorboard_manager = VertexTensorboardManager()
  if config.use_vertex_tensorboard or os.environ.get("UPLOAD_DATA_TO_TENSORBOARD"):
-    vertex_tensorboard_manager.configure_vertex_tensorboard(config)
+    if _vertex_tb_is_stub:


Can this if check be moved to configure_vertext_tensorboard() to keep train.py clean?

SurbhiJainUSC · 2026-01-13T22:29:30Z

src/MaxText/metric_logger.py

+    elif (
+        config.report_heartbeat_metric_for_gcp_monitoring or config.report_performance_metric_for_gcp_monitoring
+    ) and _monitor_is_stub:
+      max_logging.log("[DECOUPLED NO-OP] skipping GCP workload monitoring threads.")


We can move this as the first check inside get_performance_metric_queue()

SurbhiJainUSC · 2026-01-13T22:36:47Z

src/MaxText/gcloud_stub.py

 __all__.append("vertex_tensorboard_components")

-# ---------------- TensorBoardX (moved stub) -----------------
+# ---------------- ML Diagnostics (google_cloud_mldiagnostics) -----------------


The stub classes use both _Stub and _Dummyprefix. Can we use a single convention, such as _Stub, throughout the module for consistency?

SurbhiJainUSC · 2026-01-13T22:37:15Z

src/MaxText/elastic_train.py

-  with diagnostic.diagnose(diagnostic_config):
-    with maybe_record_goodput(recorder, GoodputEvent.JOB), maybe_monitor_goodput(config):
+  # In decoupled mode or when diagnostics are stubbed, skip the diagnose wrapper
+  if is_decoupled() or getattr(diagnostic, "__class__", None).__name__ == "_StubDiag":


Same comment as train.py

SurbhiJainUSC · 2026-01-13T22:39:53Z

tests/integration_tests/grpo_correctness.py

 from MaxText.layers import models
+import pytest
+
+pytestmark = [pytest.mark.external_serving]  # uses pre-generated checkpoint


external_training

SurbhiJainUSC · 2026-01-13T22:42:40Z

tests/grpo_trainer_correctness_test.py

 from MaxText.experimental.rl import grpo_utils

+# This test is for serving pathways via offline_engine and maxengine.
+pytestmark = [pytest.mark.external_serving]


external_training

gulsumgudukbay and others added 25 commits December 21, 2025 06:16

adding necessary files

2beebcf

add decoupling logic config patch to tests

f5755bb

add correct ICI parallelism to tests for decoupled mode

ef9c62e

Add tpu_only marker to train compile test

b6be265

fixing more UTs

583eee9

adding decoupling logic, biggest change

91e51d5

add tensorboardX stub

4647288

adding tokamax changes along with some UT fixes

4e7afd6

fixing little UT issues:

ebfef26

fixing train_tests

9bf8105

removing CI workflows for now to upstream decoupling changes

4d6d2b0

making jetstream and tunix optional and add is_stub variables

ad66362

removing tunix from decoupling logic

0e97a31

(cherry picked from commit e8cc951)

removing tunix from decoupled mode logic

227b980

addressing PR comments

06365cb

(cherry picked from commit 0b58e96)

fixing pylint issues

fe14ece

(cherry picked from commit 14f0508)

renaming datasets to local_datasets to avoid confusion with HF datase…

58b135b

…ts library (cherry picked from commit 6f0b361)

pyink fixes

c6d31bd

(cherry picked from commit e43e370)

Rename GCE_MARKERS to GCP_MARKERS

a9bfd32

(cherry picked from commit 1c14d6c)

updating dataset paths and fix gcloud_Stub

b9c7dbf

making decoupled mode work with upstream updates

7476b28

updates for upstream sync UTs

c8d7898

remove context_parallel_strategy config param, todo: add it back later

71f862a

make jax_remove_size_one_mesh_axis_from_type param setting in try blo…

8c27a9e

…ck, todo: remove this after updating jax. Configure ICI data parallelism for decoupled mode

Fix decoupled rampup and mesh configs for bare-metal tests

c5dbfdc

gulsumgudukbay requested review from bvandermoon, gobbleturk, jacoguzo, richjames0 and shralex as code owners December 21, 2025 06:49

gulsumgudukbay added 4 commits January 8, 2026 02:39

parameterize test_env_smoke tests

16b8afc

pyink and pylint fixes

cd103b0

hopefully last pylint fix

182299e

undo sft test decoupling changes as it is marked as external training

f472c71