Skip to content

RecursionError: maximum recursion depth exceeded while calling a Python object #59

@wangkevin02

Description

@wangkevin02

When running the code on a single machine with multiple GPUs, I encountered an error, whether using DeepSpeed ZeRO or ZeRO-3. Specifically, the error occurred when initializing the distributed model using Accelerate, resulting in a recursive error. For context, I used Llama3 as both the critic model and policy model, aiming simply to run the entire code end-to-end.

Code Modifications

I only made changes to a portion of the data preprocessing code and did not alter the model class code.

Troubleshooting Steps Tried

I attempted to resolve the issue by changing the DeepSpeed version, but this did not resolve the problem.

Traceback (most recent call last):
  File "train_ppo.py", line 228, in <module>
    main(opt)
  File "train_ppo.py", line 220, in main
    trainer = PPOTrainer(opt, policy_model, ref_model, critic_model, reward_model, accelerator)
  File "/hy-tmp/My_MOSS-RLHF/ppo/ppo_trainer.py", line 116, in __init__
    self.model, self.optimizer, self.scheduler = self.accelerator.prepare(self.model, self.optimizer, self.scheduler)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/accelerate/accelerator.py", line 1344, in prepare
    result = self._prepare_deepspeed(*args)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/accelerate/accelerator.py", line 1851, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/deepspeed/__init__.py", line 193, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 271, in __init__
    self._configure_distributed_model(model)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1160, in _configure_distributed_model
    self.module.bfloat16()
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 856, in bfloat16
    return self._apply(lambda t: t.bfloat16() if t.is_floating_point() else t)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  [Previous line repeated 982 more times]
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 663, in _apply
    with torch.no_grad():
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 133, in __enter__
    torch.set_grad_enabled(False)
  File "/usr/local/miniconda3/envs/rlhf/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 228, in __init__
    self.prev = torch.is_grad_enabled()
RecursionError: maximum recursion depth exceeded while calling a Python object
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58428 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58430 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58431 closing signal SIGTERM

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions