Skip to content

Add optional torchembed RoPE backend to apply_rotary_pos_emb#8052

Open
py-ai-dev wants to merge 1 commit into
deepspeedai:masterfrom
py-ai-dev:add-torchembed-rope-backend
Open

Add optional torchembed RoPE backend to apply_rotary_pos_emb#8052
py-ai-dev wants to merge 1 commit into
deepspeedai:masterfrom
py-ai-dev:add-torchembed-rope-backend

Conversation

@py-ai-dev

Copy link
Copy Markdown

Adds torchembed as an optional fused RoPE backend for deepspeed.sequence.layer.apply_rotary_pos_emb(), following the same pattern used in transformers and vLLM.

Changes

  • deepspeed/sequence/layer.py: Add try/except ImportError guard for torchembed._triton.fused_rope_forward. When torchembed is installed, the tensor is on CUDA, and rotary_dim is even, the function dispatches to the fused triton kernel instead of the PyTorch reference path.

  • setup.py: Add torchembed extras key (pip install deepspeed[torchembed]).

  • tests/unit/sequence/test_apply_rotary_pos_emb.py: Numerical correctness vs PyTorch reference across seq_len (1/17/128), dim (32/64/128), and various rotary_dim. Gradient flow test.

Implementation details

The torchembed kernel processes (*leading, seq_len, dim) tensors with RotaryEmbedding(use_fused=True), applying Neox-style RoPE via triton. The helper reshapes arbitrary leading dims, calls the kernel, and restores the original shape — transparent to callers.

Testing

pytest tests/unit/sequence/test_apply_rotary_pos_emb.py -v

- Add try/except ImportError guard for torchembed in sequence/layer.py
- Dispatch to fused triton kernel from apply_rotary_pos_emb() when
  torchembed is installed and tensor is on CUDA
- Add torchembed extras entry in setup.py
- Add tests: numerical correctness vs reference, gradient flow

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3855dbb0c2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

t = (t * freqs_cos) + (_rotate_half(t) * freqs_sin)
if _torchembed_available and t.is_cuda and t.device.type == 'cuda' and rot_dim % 2 == 0:
orig_shape = t.shape
t_2d = t.reshape(-1, orig_shape[-2], rot_dim)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve the actual sequence dimension for fused RoPE

When torchembed is installed and the tensor is CUDA, this reshapes every input as if the sequence length were orig_shape[-2] (the head dimension for [seq,b,heads,dim] or [b,seq,heads,dim] tensors). The existing function contract says t is [seq_length, ..., dim], and callers such as fpdt_layer.py also pass [b,l,nh,hd], so this sends the fused kernel a fake sequence length equal to the number of heads while freqs_cos/sin still describe the real sequence length, producing wrong rotations or shape/indexing failures only in the optional fused path.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant