-
Notifications
You must be signed in to change notification settings - Fork 104
Open
Description
When running the code on a single machine with multiple GPUs, I encountered an error, whether using DeepSpeed ZeRO or ZeRO-3. Specifically, the error occurred when initializing the distributed model using Accelerate, resulting in a recursive error. For context, I used Llama3 as both the critic model and policy model, aiming simply to run the entire code end-to-end.
Code Modifications
I only made changes to a portion of the data preprocessing code and did not alter the model class code.
Troubleshooting Steps Tried
I attempted to resolve the issue by changing the DeepSpeed version, but this did not resolve the problem.
Traceback (most recent call last):
File "train_ppo.py", line 228, in <module>
main(opt)
File "train_ppo.py", line 220, in main
trainer = PPOTrainer(opt, policy_model, ref_model, critic_model, reward_model, accelerator)
File "/hy-tmp/My_MOSS-RLHF/ppo/ppo_trainer.py", line 116, in __init__
self.model, self.optimizer, self.scheduler = self.accelerator.prepare(self.model, self.optimizer, self.scheduler)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/accelerate/accelerator.py", line 1344, in prepare
result = self._prepare_deepspeed(*args)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/accelerate/accelerator.py", line 1851, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/deepspeed/__init__.py", line 193, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 271, in __init__
self._configure_distributed_model(model)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1160, in _configure_distributed_model
self.module.bfloat16()
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 856, in bfloat16
return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
[Previous line repeated 982 more times]
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 663, in _apply
with torch.no_grad():
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 133, in __enter__
torch.set_grad_enabled(False)
File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 228, in __init__
self.prev = torch.is_grad_enabled()
RecursionError: maximum recursion depth exceeded while calling a Python object
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58428 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58430 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58431 closing signal SIGTERM
efraimdahl
Metadata
Metadata
Assignees
Labels
No labels