Conversation
| namespace scan | ||
| { | ||
|
|
||
| template<class Config, class BinOp, bool ForwardProgressGuarantees, class device_capabilities=void> |
There was a problem hiding this comment.
we should make a fake device feature called forwardProgressGuarantees which is basically always false
| template<typename T> // only uint32_t or uint64_t for now? | ||
| struct Constants | ||
| { | ||
| NBL_CONSTEXPR_STATIC_INLINE T NOT_READY = 0; | ||
| NBL_CONSTEXPR_STATIC_INLINE T LOCAL_COUNT = T(0x1u) << (sizeof(T)*8-2); | ||
| NBL_CONSTEXPR_STATIC_INLINE T GLOBAL_COUNT = T(0x1u) << (sizeof(T)*8-1); | ||
| NBL_CONSTEXPR_STATIC_INLINE T STATUS_MASK = LOCAL_COUNT | GLOBAL_COUNT; | ||
| }; |
There was a problem hiding this comment.
you can use enum class if you update DXC btw also with : uint16_t or : uint64_t
not everything needs to be a crazy template
| scalar_t __call(NBL_REF_ARG(DataAccessor) dataAccessor, NBL_REF_ARG(ScratchAccessor) sharedMemScratchAccessor) | ||
| { | ||
| const scalar_t localReduction = workgroup_reduce_t::__call<DataAccessor, ScratchAccessor>(dataAccessor, sharedMemScratchAccessor); | ||
| bda::__ptr<T> scratch = dataAccessor.getScratchPtr(); // scratch data should be at least T[NumWorkgroups] |
There was a problem hiding this comment.
ask for a separate accessor for the workgroup flags
Also 32bit atomics are always present and probably the cheapest, there's no harm in packing 16 workgroup flags into the same uint32_t but please benchmark if its a net gain (more likely to sit in GPU's L2 cache) or loss (contention)
There was a problem hiding this comment.
aaah that might not be compatible with using UMax though
| template<class ReadOnlyDataAccessor, class ScratchAccessor NBL_FUNC_REQUIRES(workgroup2::ArithmeticReadOnlyDataAccessor<ReadOnlyDataAccessor,scalar_t> && workgroup2::ArithmeticSharedMemoryAccessor<ScratchAccessor,scalar_t>) | ||
| static scalar_t __call(NBL_REF_ARG(ReadOnlyDataAccessor) dataAccessor, NBL_REF_ARG(ScratchAccessor) sharedMemScratchAccessor) |
There was a problem hiding this comment.
you need a 3rd accessor which is an atomic accessor (to both accumulate the result and figure out when everyone is done)
| // get last item from scratch | ||
| const uint32_t lastWorkgroup = glsl::gl_NumWorkGroups().x - 1; | ||
| bda::__ref<scalar_t> scratchLast = (scratch + lastWorkgroup).deref(); | ||
| scalar_t value = constants_t::NOT_READY; | ||
| if (lastInvocation) | ||
| { | ||
| // wait until last workgroup does reduction | ||
| while (!(value & constants_t::GLOBAL_COUNT)) | ||
| { | ||
| // value = spirv::atomicLoad(scratchLast.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask); | ||
| value = spirv::atomicIAdd(scratchLast.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask, 0u); | ||
| } | ||
| } | ||
| value = workgroup::Broadcast(value, sharedMemScratchAccessor, Config::WorkgroupSize-1); | ||
| return value & (~constants_t::STATUS_MASK); |
There was a problem hiding this comment.
This won't work even with forward progress guarantees, you just need to let the workgroup quit
| template<class DataAccessor, class ScratchAccessor> | ||
| scalar_t __call(NBL_REF_ARG(DataAccessor) dataAccessor, NBL_REF_ARG(ScratchAccessor) sharedMemScratchAccessor) |
There was a problem hiding this comment.
you have the readonly accessor to get your element, and you have the scratch memory accessor (for the workgroup scans/reductions) but you don't have:
- accessor for the Device-Scope scratch
- accessor for where to store the reduction result (reduction is special compared to a scan, you can't get the result right away)
| if (lastInvocation) | ||
| { | ||
| bda::__ref<scalar_t> scratchId = (scratch + glsl::gl_WorkGroupID().x).deref(); | ||
| spirv::atomicUMax(scratchId.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsReleaseMask, localReduction|constants_t::LOCAL_COUNT); |
There was a problem hiding this comment.
you want to separate the storage of the reduction from the flags I think.
| for (uint32_t i = 1; i <= glsl::gl_WorkGroupID().x; i++) | ||
| { | ||
| const uint32_t prevID = glsl::gl_WorkGroupID().x-i; |
There was a problem hiding this comment.
don't use gl_WorkGroupID ask for the virtualWorkgroupIndex in the function call
| // value = spirv::atomicLoad(scratchPrev.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask); | ||
| value = spirv::atomicIAdd(scratchPrev.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask, 0u); |
There was a problem hiding this comment.
you'll have multiple workgroups doing this, you'll mess up the results, you want to accumulate those locally here in a register
so while you're walking backwards, you only do prefix += atomicLoad(scratchPrev.__get_spv_ptr(), spv::ScopeDevice, MakeVisible);
Also this requires that you have two different scratch store locations:
- one for local reduction
- one for global reduction
If you keep the global and local on the same address, you get nasty data races because the status is a flag and not a mutex, so you can overwrite a local result with a global while another workgroup reads and it thinks that the value it reads is a local result because flag is not updated yet.
P.S. You probably don't want to be writing out the GLOBAL results and updating status flags here even though you can, because other workgroups before you are obviously "just about" to write out their results, and this way you'll just introduce more uncached memory traffic.
| if (lastInvocation) // don't make whole block work and do busy stuff | ||
| { | ||
| // for (uint32_t prevID = glsl::gl_WorkGroupID().x-1; prevID >= 0u; prevID--) // won't run properly this way for some reason, results in device lost | ||
| for (uint32_t i = 1; i <= glsl::gl_WorkGroupID().x; i++) |
There was a problem hiding this comment.
actually using the whole workgroup or at least subgroup (benchmark it) would be much faster here, so each invocation checks a workgroup and you can use workgroup2::reduce with a Max binop to find the first preceeding workgroup with a ready GLOBAL scan value
You'd also be able to accumulate the prefix faster over the ones which have LOCAL ready
Description
Implementation of Block Chain Scan
Testing
Example XXX
TODO list:
We'll let @kpentaris review.