Add optional torchembed RoPE backend to apply_rotary_pos_emb by py-ai-dev · Pull Request #8052 · deepspeedai/DeepSpeed

py-ai-dev · 2026-06-07T21:35:31Z

Adds torchembed as an optional fused RoPE backend for deepspeed.sequence.layer.apply_rotary_pos_emb(), following the same pattern used in transformers and vLLM.

Changes

deepspeed/sequence/layer.py: Add try/except ImportError guard for torchembed._triton.fused_rope_forward. When torchembed is installed, the tensor is on CUDA, and rotary_dim is even, the function dispatches to the fused triton kernel instead of the PyTorch reference path.
setup.py: Add torchembed extras key (pip install deepspeed[torchembed]).
tests/unit/sequence/test_apply_rotary_pos_emb.py: Numerical correctness vs PyTorch reference across seq_len (1/17/128), dim (32/64/128), and various rotary_dim. Gradient flow test.

Implementation details

The torchembed kernel processes (*leading, seq_len, dim) tensors with RotaryEmbedding(use_fused=True), applying Neox-style RoPE via triton. The helper reshapes arbitrary leading dims, calls the kernel, and restores the original shape — transparent to callers.

Testing

pytest tests/unit/sequence/test_apply_rotary_pos_emb.py -v

- Add try/except ImportError guard for torchembed in sequence/layer.py - Dispatch to fused triton kernel from apply_rotary_pos_emb() when torchembed is installed and tensor is on CUDA - Add torchembed extras entry in setup.py - Add tests: numerical correctness vs reference, gradient flow

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3855dbb0c2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T21:37:11Z

-    t = (t * freqs_cos) + (_rotate_half(t) * freqs_sin)
+    if _torchembed_available and t.is_cuda and t.device.type == 'cuda' and rot_dim % 2 == 0:
+        orig_shape = t.shape
+        t_2d = t.reshape(-1, orig_shape[-2], rot_dim)


Preserve the actual sequence dimension for fused RoPE

When torchembed is installed and the tensor is CUDA, this reshapes every input as if the sequence length were orig_shape[-2] (the head dimension for [seq,b,heads,dim] or [b,seq,heads,dim] tensors). The existing function contract says t is [seq_length, ..., dim], and callers such as fpdt_layer.py also pass [b,l,nh,hd], so this sends the fused kernel a fake sequence length equal to the number of heads while freqs_cos/sin still describe the real sequence length, producing wrong rotations or shape/indexing failures only in the optional fused path.

Useful? React with 👍 / 👎.

py-ai-dev requested review from loadams, tjruwase and tohtana as code owners June 7, 2026 21:35

chatgpt-codex-connector Bot reviewed Jun 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional torchembed RoPE backend to apply_rotary_pos_emb#8052

Add optional torchembed RoPE backend to apply_rotary_pos_emb#8052
py-ai-dev wants to merge 1 commit into
deepspeedai:masterfrom
py-ai-dev:add-torchembed-rope-backend

py-ai-dev commented Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

py-ai-dev commented Jun 7, 2026

Changes

Implementation details

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant