Update model ReadMe

Rohan-Bierneni · Rohan-Bierneni · commit 8be00bdb6d91 · 2026-02-20T20:17:01.000Z
diff --git a/src/maxtext/trainers/pre_train/train.py b/src/maxtext/trainers/pre_train/train.py
@@ -42,8 +42,6 @@
 from MaxText import sharding
 from MaxText.common_types import ShardMode
 from MaxText.globals import EPS
-# pylint: disable-next=unused-import
-from maxtext import maxtext_google
 
 from MaxText.gradient_accumulation import gradient_accumulation_loss_and_grad
 from MaxText.vocabulary_tiling import vocab_tiling_linen_loss
diff --git a/tests/end_to_end/tpu/qwen/next/run_qwen3_next.md b/tests/end_to_end/tpu/qwen/next/run_qwen3_next.md
@@ -7,6 +7,31 @@ For more details on the architecture, see the [Qwen3 Technical Blog](https://qwe
 
 * * * * *
 
+Pre-Training
+---------------------
+You can train from scratch to generate a new checkpoint. One example command to run pretraining with Qwen3-Next on v5p-64.
+
+```sh
+python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
+    base_output_directory=${BASE_OUTPUT_DIRECTORY} \
+    run_name=q3_next_pre_training \
+    per_device_batch_size=1 \
+    enable_checkpointing=false \
+    model_name=qwen3-next-80b-a3b \
+    ici_fsdp_parallelism=-1 \
+    steps=5 \
+    max_target_length=1024 \
+    async_checkpointing=false \
+    tokenizer_type=huggingface \
+    tokenizer_path=src/maxtext/assets/tokenizers/qwen3-tokenizer \
+    attention=flash \
+    dtype=bfloat16 \
+    weight_dtype=bfloat16 \
+    megablox=False \
+    sparse_matmul=False \
+    dataset_type=synthetic
+```
+
 Checkpoint Conversion
 ---------------------
 
@@ -22,18 +47,20 @@ To get started, you first need a MaxText-compatible checkpoint.
 2.  **Convert the Checkpoint**: Run the `convert_qwen3_next_scanned.py` script to convert the downloaded Hugging Face weights into the Orbax format required by MaxText.
 
     ```
-    python3 -m maxtext.checkpoint_conversion.standalone_scripts.convert_qwen3_next_scanned \
-      --base_model_path /path/to/qwen3_next_hf_checkpoint \
-      --maxtext_model_path gs://your-gcs-bucket/qwen3_next_maxtext_ckpt \
-      --model_size qwen3-next-80b-a3b
+    JAX_PLATFORMS=cpu python3 -m maxtext.checkpoint_conversion.to_maxtext src/maxtext/configs/base.yml \
+        model_name=qwen3-next-80b-a3b \
+        base_output_directory=gs://your-gcs-bucket/qwen3_next_maxtext_ckpt \
+        hf_access_token=${HF_TOKEN} \
+        scan_layers=true \ # Set to false for unscanned checkpoint
+        use_multimodal=false
     ```
 
 * * * * *
 
-Pre-training and Fine-tuning
+Fine-tuning
 ----------------------------
 
-After converting the checkpoint, you can use it for fine-tuning or start a pre-training run from scratch. The command below is an example for fine-tuning on a v5p-512 slice. To pre-train, simply remove the `load_parameters_path` argument.
+After converting the checkpoint, you can use it for fine-tuning. The command below is an example for fine-tuning on a v5p-64 slice.
 
 ```
 python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
@@ -43,39 +70,90 @@ python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
     run_name=qwen3_next_finetuning \
     per_device_batch_size=1 \
     model_name=qwen3-next-80b-a3b \
-    steps=500 \
-    max_target_length=8192 \
-    ici_fsdp_parallelism=256 \
+    steps=30 \
+    max_target_length=4096 \
+    ici_fsdp_parallelism=-1 \
     tokenizer_type=huggingface \
     tokenizer_path=src/maxtext/assets/tokenizers/qwen3-tokenizer
+```
+
+## Decoding
+One example command to run decoding with Qwen3-Next on v5p-64 with unscanned checkpoint for fast decoding.
 
+```sh
+python3 -m maxtext.decode src/maxtext/configs/base.yml \
+    base_output_directory=${BASE_OUTPUT_DIRECTORY} \
+    load_parameters_path=${CONVERTED_CHECKPOINT} \
+    run_name=q3-next-decode \
+    per_device_batch_size=1 \
+    enable_checkpointing=false \
+    model_name=qwen3-next-80b-a3b \
+    max_prefill_predict_length=64 \
+    max_target_length=1024 \
+    tokenizer_type=huggingface \
+    tokenizer_path=src/maxtext/assets/tokenizers/qwen3-tokenizer \
+    attention=dot_product \
+    dtype=bfloat16 \
+    weight_dtype=bfloat16 \
+    megablox=False \
+    sparse_matmul=False \
+    ici_tensor_parallelism=1 \
+    ici_fsdp_parallelism=1 \
+    ici_expert_parallelism=-1 \
+    prompt="An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and outputs are all vectors. The output is " \
+    scan_layers=False
 ```
 
 * * * * *
 
 Correctness Validation
 ----------------------
 
-To verify that the MaxText implementation is numerically equivalent to the original Hugging Face model, you can run the end-to-end test scripts. These scripts automate the logit comparison test for each model.
+we perform two primary checks:
 
-Before running, you must set the `MAXTEXT_CHECKPOINT_PATH` environment variable. You can also optionally set `HF_MODEL_PATH` to point to a local copy of the Hugging Face model.
+* **Logit Comparison**: We compare the logits generated by our implementation against those from a HuggingFace implementation for a set of given prompts.
+* **MMLU Score Validation**: We validate the MMLU score against established benchmarks.
 
-### Qwen3-Next-80B-A3B
-
-Bash
+One example command to generate golden logits from HuggingFace for Qwen3-Next:
 
+```sh
+python3 -m tests.assets.logits_generation.generate_hf_golden_logits \
+    --model-id=Qwen/Qwen3-Next-80B-A3B-Instruct \
+    --output-path=golden_Qwen3_Next.jsonl \
+    --prompts='I love to;Today is a;What is the'
 ```
-# Set the required path to your converted MaxText checkpoint
-export MAXTEXT_CHECKPOINT_PATH=gs://your-gcs-bucket/qwen3-next-80b-a3b_maxtext_ckpt/0/items/
 
-# (Optional) Set the path to your local Hugging Face checkpoint
-# export HF_MODEL_PATH=/path/to/local/qwen3-next-80b-a3b_hf_checkpoint
+You should be able to see logs like below:
 
-# Execute the validation script
-bash tests/end_to_end/tpu/qwen/next/qwen3-next-80b-a3b/1_test_qwen3_next_80b_a3b.sh
+```
+...
+File is stored locally at golden_Qwen3_Next.jsonl.
+```
 
+Run command below to compare logits between HuggingFace and MaxText.
+
+```sh
+python3 -m tests.utils.forward_pass_logit_checker \
+    src/maxtext/configs/base.yml \
+    tokenizer_type=huggingface \
+    tokenizer_path=Qwen/Qwen3-Next-80B-A3B-Instruct \
+    load_parameters_path=${CONVERTED_CHECKPOINT} \
+    run_name=forward_pass_test_qwen3_next \
+    per_device_batch_size=1 \
+    model_name=qwen3-next-80b-a3b \
+    max_prefill_predict_length=4 \
+    max_target_length=4 \
+    scan_layers=false \
+    sparse_matmul=False \
+    dtype=float32 \
+    activations_in_float32=true \
+    matmul_precision=high \
+    --max_kl_div=2e-4 \
+    --golden_logits_path=${PWD}/golden_Qwen3_Next.jsonl
 ```
 
+To run MMLU benchmarks and validate the model's performance, follow the instructions provided [here](../../../benchmarks/api_server/README.md).
+
 ## Supported MoE Strategies
 
 This model implementation supports both **Token Dropping** and **Dropless** strategies for Mixture of Experts routing. Take a look at the MaxText [documentation](https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/reference/core_concepts/moe_configuration.md) on MoE configs and flags to set based on desired strategy.