Conversation
896c191 to
455b1ef
Compare
455b1ef to
e4e40e8
Compare
b3e676a to
823adfd
Compare
| parser.add_argument("--seed", type=int, default=1234, help="RNG seed.") | ||
| parser.add_argument( | ||
| "--fp8", action="store_true", default=False, help="Enables the te.fp8_autocast() context." | ||
| "--fp8", action="store_true", default=False, help="Enables the te.autocast() context." |
There was a problem hiding this comment.
Up to TE v2.8, I think it's still fp8_autocast. Were you targeting at higher versions?
There was a problem hiding this comment.
I think you had a few comments on this, so will address it here quickly. I moved the UB code up to release 2.10, as there were a few bugs and inefficiencies that NV fixed. Most of the changes that aren't guarded in the files are NV upstream changes.
I am fixing up the te_layer_with_overlap differences, and working on integrating the benchmark script into the file directly.
|
|
||
| # This file was modified for portability to AMDGPU | ||
| # Copyright (c) 2025-2026, Advanced Micro Devices, Inc. All rights reserved. | ||
| # Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Was this file sharing a lot of codes with examples/pytorch/comm_gemm_overlap/te_layer_with_overlap.py? Is it possible to consolidate those two files
| import transformer_engine.pytorch.cpp_extensions as tex | ||
| from transformer_engine.pytorch.fp8 import FP8GlobalStateManager | ||
|
|
||
| from transformer_engine.jax.cpp_extensions.misc import is_hip_extension |
There was a problem hiding this comment.
Let's not import jax specific code into pytorch side. Use this instead:
There was a problem hiding this comment.
Good catch, this is an mistake. Will fix.
| if (_ub_comm->myrank == 0) printf("!!! [UB] Register UBuf %d\n", _ub_reg); | ||
| if (_ub_comm->myrank == 0) { | ||
| printf("!!! [UB] Register UBuf %d\n", _ub_reg); | ||
| } |
There was a problem hiding this comment.
I would prefer aligning the coding style with NV upstream so it's easier for us to maintain/IFU later
| allgather_handle, barrier_handle, tp_size, num_max_streams, comm_cga_size, | ||
| gemm_priority, comm_priority, num_comm_sm, set_sm_margin, use_ce, | ||
| atomic_gemm) { | ||
| initialize(buffer_shape, buffer_dtype, comm_type, aggregate); |
There was a problem hiding this comment.
Same question here for the motivation of this initialize function in the constructor
transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp
Outdated
Show resolved
Hide resolved
d779653 to
470f153
Compare
| NVTE_CHECK_CUDA(cudaMemset((*comm)->flags_baseptr, 0, 2 * GPU_PAGE_SIZE)); | ||
| (*comm)->flags = reinterpret_cast<int *>( | ||
| #ifdef __HIP_PLATFORM_AMD__ | ||
| (reinterpret_cast<uintptr_t>((*comm)->flags) + GPU_PAGE_SIZE - 1) & GPU_PAGE_MASK); |
There was a problem hiding this comment.
Should it be (*comm)->flags_baseptr as the nv upstream below? (*comm)->flags is not allocated/assigned above
|
|
||
| __syncthreads(); | ||
| if (threadIdx.x == 0) __threadfence_system(); | ||
| if (threadIdx.x == 0) __threadfence(); |
There was a problem hiding this comment.
Looks like __threadfence_system() is now supported in rocm 7.2: https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#memory-fence-instructions
| void userbuffers_send(const int srchandler, const size_t srcoffset, const int dsthandler, | ||
| const size_t dstoffset, const size_t bytes, communicator *comm, | ||
| const int peer, cudaStream_t stream) { | ||
| const int peer, cudaStream_t stream, int ring_id) { |
There was a problem hiding this comment.
Emm, I guess my question then would be why NV upstream does not need a ring_id? Is it because of we have different implementation? The NVTE_ROCM_MAX_RINGS?
| _comm_priority = comm_priority; | ||
| } | ||
| for (int i = 0; i < std::min(num_max_streams, num_splits); i++) { | ||
| for (int i = 0; i < std::max(num_max_streams, num_splits); i++) { |
There was a problem hiding this comment.
In fact, do we need stream numbers more than the min of max_stream and num_splits?
| NVTE_DIM_CHECK(chunk_height > 0 && chunk_width > 0, "Attempted to get empty tensor chunk"); | ||
| NVTE_DIM_CHECK(chunk_height <= height && chunk_width <= width, | ||
| "Attempted to get out-of-bounds tensor chunk"); | ||
| #ifndef __HIP_PLATFORM_AMD__ |
There was a problem hiding this comment.
Since we already support mxfp8. Add a to-do comment so that we won't forget to turn it on later
|
|
||
| // Input data | ||
| const size_t source_size = source.numel(); | ||
| const void *src_ptr = (rowwise) ? source.dptr() : source.columnwise_dptr(); |
There was a problem hiding this comment.
Well, what if we need both row-wise and colwise? How about other fields of a tensor, for example, scale inv?
| "num_sm": 1 if method == "ring_exchange" else 16, | ||
| "cga_size": 1 if method == "ring_exchange" else 2, | ||
| "set_sm_margin": not method == "ring_exchange", | ||
| "set_sm_margin": not method == "ring_exchange" and not IS_HIP_EXTENSION, |
There was a problem hiding this comment.
Ilya already had the sm_margin feature supported on rocm
| if IS_HIP_EXTENSION and user_ub_cfg is not None: | ||
| for name, cfg in user_ub_cfg.items(): | ||
| assert cfg.get("method") != "bulk", ( | ||
| f"Bulk overlap method for '{name}' is not supported on HIP/ROCm. " |
There was a problem hiding this comment.
I recall we supported bulk overlap but the performance is not great?
| "<nvtx3/nvToolsExt.h>" : "<roctracer/roctx.h>", | ||
| "cudaFuncSetAttribute(" : "hipFuncSetAttribute((const void*)" | ||
| "cudaFuncSetAttribute(" : "hipFuncSetAttribute((const void*)", | ||
| "cudaLaunchKernel": "hipLaunchKernel", |
There was a problem hiding this comment.
cudaLaunchKernel cannot be hipified?
This is the userbuffer_epic branch, to be merged only once all epic tasks have been completed. PRs for epic tasks will be onto this branch.