Skip to content

Commit 8be00bd

Browse files
Update model ReadMe
1 parent 98a4b48 commit 8be00bd

2 files changed

Lines changed: 98 additions & 22 deletions

File tree

src/maxtext/trainers/pre_train/train.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,6 @@
4242
from MaxText import sharding
4343
from MaxText.common_types import ShardMode
4444
from MaxText.globals import EPS
45-
# pylint: disable-next=unused-import
46-
from maxtext import maxtext_google
4745

4846
from MaxText.gradient_accumulation import gradient_accumulation_loss_and_grad
4947
from MaxText.vocabulary_tiling import vocab_tiling_linen_loss

tests/end_to_end/tpu/qwen/next/run_qwen3_next.md

Lines changed: 98 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,31 @@ For more details on the architecture, see the [Qwen3 Technical Blog](https://qwe
77

88
* * * * *
99

10+
Pre-Training
11+
---------------------
12+
You can train from scratch to generate a new checkpoint. One example command to run pretraining with Qwen3-Next on v5p-64.
13+
14+
```sh
15+
python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
16+
base_output_directory=${BASE_OUTPUT_DIRECTORY} \
17+
run_name=q3_next_pre_training \
18+
per_device_batch_size=1 \
19+
enable_checkpointing=false \
20+
model_name=qwen3-next-80b-a3b \
21+
ici_fsdp_parallelism=-1 \
22+
steps=5 \
23+
max_target_length=1024 \
24+
async_checkpointing=false \
25+
tokenizer_type=huggingface \
26+
tokenizer_path=src/maxtext/assets/tokenizers/qwen3-tokenizer \
27+
attention=flash \
28+
dtype=bfloat16 \
29+
weight_dtype=bfloat16 \
30+
megablox=False \
31+
sparse_matmul=False \
32+
dataset_type=synthetic
33+
```
34+
1035
Checkpoint Conversion
1136
---------------------
1237

@@ -22,18 +47,20 @@ To get started, you first need a MaxText-compatible checkpoint.
2247
2. **Convert the Checkpoint**: Run the `convert_qwen3_next_scanned.py` script to convert the downloaded Hugging Face weights into the Orbax format required by MaxText.
2348
2449
```
25-
python3 -m maxtext.checkpoint_conversion.standalone_scripts.convert_qwen3_next_scanned \
26-
--base_model_path /path/to/qwen3_next_hf_checkpoint \
27-
--maxtext_model_path gs://your-gcs-bucket/qwen3_next_maxtext_ckpt \
28-
--model_size qwen3-next-80b-a3b
50+
JAX_PLATFORMS=cpu python3 -m maxtext.checkpoint_conversion.to_maxtext src/maxtext/configs/base.yml \
51+
model_name=qwen3-next-80b-a3b \
52+
base_output_directory=gs://your-gcs-bucket/qwen3_next_maxtext_ckpt \
53+
hf_access_token=${HF_TOKEN} \
54+
scan_layers=true \ # Set to false for unscanned checkpoint
55+
use_multimodal=false
2956
```
3057
3158
* * * * *
3259
33-
Pre-training and Fine-tuning
60+
Fine-tuning
3461
----------------------------
3562
36-
After converting the checkpoint, you can use it for fine-tuning or start a pre-training run from scratch. The command below is an example for fine-tuning on a v5p-512 slice. To pre-train, simply remove the `load_parameters_path` argument.
63+
After converting the checkpoint, you can use it for fine-tuning. The command below is an example for fine-tuning on a v5p-64 slice.
3764
3865
```
3966
python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
@@ -43,39 +70,90 @@ python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
4370
run_name=qwen3_next_finetuning \
4471
per_device_batch_size=1 \
4572
model_name=qwen3-next-80b-a3b \
46-
steps=500 \
47-
max_target_length=8192 \
48-
ici_fsdp_parallelism=256 \
73+
steps=30 \
74+
max_target_length=4096 \
75+
ici_fsdp_parallelism=-1 \
4976
tokenizer_type=huggingface \
5077
tokenizer_path=src/maxtext/assets/tokenizers/qwen3-tokenizer
78+
```
79+
80+
## Decoding
81+
One example command to run decoding with Qwen3-Next on v5p-64 with unscanned checkpoint for fast decoding.
5182
83+
```sh
84+
python3 -m maxtext.decode src/maxtext/configs/base.yml \
85+
base_output_directory=${BASE_OUTPUT_DIRECTORY} \
86+
load_parameters_path=${CONVERTED_CHECKPOINT} \
87+
run_name=q3-next-decode \
88+
per_device_batch_size=1 \
89+
enable_checkpointing=false \
90+
model_name=qwen3-next-80b-a3b \
91+
max_prefill_predict_length=64 \
92+
max_target_length=1024 \
93+
tokenizer_type=huggingface \
94+
tokenizer_path=src/maxtext/assets/tokenizers/qwen3-tokenizer \
95+
attention=dot_product \
96+
dtype=bfloat16 \
97+
weight_dtype=bfloat16 \
98+
megablox=False \
99+
sparse_matmul=False \
100+
ici_tensor_parallelism=1 \
101+
ici_fsdp_parallelism=1 \
102+
ici_expert_parallelism=-1 \
103+
prompt="An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and outputs are all vectors. The output is " \
104+
scan_layers=False
52105
```
53106

54107
* * * * *
55108

56109
Correctness Validation
57110
----------------------
58111

59-
To verify that the MaxText implementation is numerically equivalent to the original Hugging Face model, you can run the end-to-end test scripts. These scripts automate the logit comparison test for each model.
112+
we perform two primary checks:
60113

61-
Before running, you must set the `MAXTEXT_CHECKPOINT_PATH` environment variable. You can also optionally set `HF_MODEL_PATH` to point to a local copy of the Hugging Face model.
114+
* **Logit Comparison**: We compare the logits generated by our implementation against those from a HuggingFace implementation for a set of given prompts.
115+
* **MMLU Score Validation**: We validate the MMLU score against established benchmarks.
62116

63-
### Qwen3-Next-80B-A3B
64-
65-
Bash
117+
One example command to generate golden logits from HuggingFace for Qwen3-Next:
66118

119+
```sh
120+
python3 -m tests.assets.logits_generation.generate_hf_golden_logits \
121+
--model-id=Qwen/Qwen3-Next-80B-A3B-Instruct \
122+
--output-path=golden_Qwen3_Next.jsonl \
123+
--prompts='I love to;Today is a;What is the'
67124
```
68-
# Set the required path to your converted MaxText checkpoint
69-
export MAXTEXT_CHECKPOINT_PATH=gs://your-gcs-bucket/qwen3-next-80b-a3b_maxtext_ckpt/0/items/
70125

71-
# (Optional) Set the path to your local Hugging Face checkpoint
72-
# export HF_MODEL_PATH=/path/to/local/qwen3-next-80b-a3b_hf_checkpoint
126+
You should be able to see logs like below:
73127

74-
# Execute the validation script
75-
bash tests/end_to_end/tpu/qwen/next/qwen3-next-80b-a3b/1_test_qwen3_next_80b_a3b.sh
128+
```
129+
...
130+
File is stored locally at golden_Qwen3_Next.jsonl.
131+
```
76132

133+
Run command below to compare logits between HuggingFace and MaxText.
134+
135+
```sh
136+
python3 -m tests.utils.forward_pass_logit_checker \
137+
src/maxtext/configs/base.yml \
138+
tokenizer_type=huggingface \
139+
tokenizer_path=Qwen/Qwen3-Next-80B-A3B-Instruct \
140+
load_parameters_path=${CONVERTED_CHECKPOINT} \
141+
run_name=forward_pass_test_qwen3_next \
142+
per_device_batch_size=1 \
143+
model_name=qwen3-next-80b-a3b \
144+
max_prefill_predict_length=4 \
145+
max_target_length=4 \
146+
scan_layers=false \
147+
sparse_matmul=False \
148+
dtype=float32 \
149+
activations_in_float32=true \
150+
matmul_precision=high \
151+
--max_kl_div=2e-4 \
152+
--golden_logits_path=${PWD}/golden_Qwen3_Next.jsonl
77153
```
78154

155+
To run MMLU benchmarks and validate the model's performance, follow the instructions provided [here](../../../benchmarks/api_server/README.md).
156+
79157
## Supported MoE Strategies
80158

81159
This model implementation supports both **Token Dropping** and **Dropless** strategies for Mixture of Experts routing. Take a look at the MaxText [documentation](https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/reference/core_concepts/moe_configuration.md) on MoE configs and flags to set based on desired strategy.

0 commit comments

Comments
 (0)