Add optional torchembed RoPE backend to apply_rotary_pos_emb#8052
Add optional torchembed RoPE backend to apply_rotary_pos_emb#8052py-ai-dev wants to merge 1 commit into
Conversation
- Add try/except ImportError guard for torchembed in sequence/layer.py - Dispatch to fused triton kernel from apply_rotary_pos_emb() when torchembed is installed and tensor is on CUDA - Add torchembed extras entry in setup.py - Add tests: numerical correctness vs reference, gradient flow
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3855dbb0c2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| t = (t * freqs_cos) + (_rotate_half(t) * freqs_sin) | ||
| if _torchembed_available and t.is_cuda and t.device.type == 'cuda' and rot_dim % 2 == 0: | ||
| orig_shape = t.shape | ||
| t_2d = t.reshape(-1, orig_shape[-2], rot_dim) |
There was a problem hiding this comment.
Preserve the actual sequence dimension for fused RoPE
When torchembed is installed and the tensor is CUDA, this reshapes every input as if the sequence length were orig_shape[-2] (the head dimension for [seq,b,heads,dim] or [b,seq,heads,dim] tensors). The existing function contract says t is [seq_length, ..., dim], and callers such as fpdt_layer.py also pass [b,l,nh,hd], so this sends the fused kernel a fake sequence length equal to the number of heads while freqs_cos/sin still describe the real sequence length, producing wrong rotations or shape/indexing failures only in the optional fused path.
Useful? React with 👍 / 👎.
Adds
torchembedas an optional fused RoPE backend fordeepspeed.sequence.layer.apply_rotary_pos_emb(), following the same pattern used in transformers and vLLM.Changes
deepspeed/sequence/layer.py: Addtry/except ImportErrorguard fortorchembed._triton.fused_rope_forward. Whentorchembedis installed, the tensor is on CUDA, androtary_dimis even, the function dispatches to the fused triton kernel instead of the PyTorch reference path.setup.py: Addtorchembedextras key (pip install deepspeed[torchembed]).tests/unit/sequence/test_apply_rotary_pos_emb.py: Numerical correctness vs PyTorch reference across seq_len (1/17/128), dim (32/64/128), and various rotary_dim. Gradient flow test.Implementation details
The torchembed kernel processes
(*leading, seq_len, dim)tensors withRotaryEmbedding(use_fused=True), applying Neox-style RoPE via triton. The helper reshapes arbitrary leading dims, calls the kernel, and restores the original shape — transparent to callers.Testing