Block Chain Scan by devshgraphicsprogramming · Pull Request #884 · Devsh-Graphics-Programming/Nabla

devshgraphicsprogramming · 2025-06-03T14:02:28Z

Description

Implementation of Block Chain Scan

Testing

Example XXX

TODO list:

We'll let @kpentaris review.

devshgraphicsprogramming · 2025-06-03T14:03:34Z

include/nbl/builtin/hlsl/scan/arithmetic.hlsl

+namespace scan
+{
+
+template<class Config, class BinOp, bool ForwardProgressGuarantees, class device_capabilities=void>


we should make a fake device feature called forwardProgressGuarantees which is basically always false

include/nbl/builtin/hlsl/scan/arithmetic.hlsl

devshgraphicsprogramming · 2025-06-03T14:05:22Z

include/nbl/builtin/hlsl/scan/arithmetic_impl.hlsl

+template<typename T>    // only uint32_t or uint64_t for now?
+struct Constants
+{
+    NBL_CONSTEXPR_STATIC_INLINE T NOT_READY = 0;
+    NBL_CONSTEXPR_STATIC_INLINE T LOCAL_COUNT = T(0x1u) << (sizeof(T)*8-2);
+    NBL_CONSTEXPR_STATIC_INLINE T GLOBAL_COUNT = T(0x1u) << (sizeof(T)*8-1);
+    NBL_CONSTEXPR_STATIC_INLINE T STATUS_MASK = LOCAL_COUNT | GLOBAL_COUNT;
+};


you can use enum class if you update DXC btw also with : uint16_t or : uint64_t

not everything needs to be a crazy template

devshgraphicsprogramming · 2025-06-03T14:08:05Z

include/nbl/builtin/hlsl/scan/arithmetic_impl.hlsl

+    scalar_t __call(NBL_REF_ARG(DataAccessor) dataAccessor, NBL_REF_ARG(ScratchAccessor) sharedMemScratchAccessor)
+    {
+        const scalar_t localReduction = workgroup_reduce_t::__call<DataAccessor, ScratchAccessor>(dataAccessor, sharedMemScratchAccessor);
+        bda::__ptr<T> scratch = dataAccessor.getScratchPtr();   // scratch data should be at least T[NumWorkgroups]


ask for a separate accessor for the workgroup flags

Also 32bit atomics are always present and probably the cheapest, there's no harm in packing 16 workgroup flags into the same uint32_t but please benchmark if its a net gain (more likely to sit in GPU's L2 cache) or loss (contention)

aaah that might not be compatible with using UMax though

devshgraphicsprogramming · 2025-06-11T08:31:30Z

include/nbl/builtin/hlsl/scan/arithmetic.hlsl

+    template<class ReadOnlyDataAccessor, class ScratchAccessor NBL_FUNC_REQUIRES(workgroup2::ArithmeticReadOnlyDataAccessor<ReadOnlyDataAccessor,scalar_t> && workgroup2::ArithmeticSharedMemoryAccessor<ScratchAccessor,scalar_t>)
+    static scalar_t __call(NBL_REF_ARG(ReadOnlyDataAccessor) dataAccessor, NBL_REF_ARG(ScratchAccessor) sharedMemScratchAccessor)


you need a 3rd accessor which is an atomic accessor (to both accumulate the result and figure out when everyone is done)

devshgraphicsprogramming · 2025-06-11T08:37:27Z

include/nbl/builtin/hlsl/scan/arithmetic_impl.hlsl

+        // get last item from scratch
+        const uint32_t lastWorkgroup = glsl::gl_NumWorkGroups().x - 1;
+        bda::__ref<scalar_t> scratchLast = (scratch + lastWorkgroup).deref();
+        scalar_t value = constants_t::NOT_READY;
+        if (lastInvocation)
+        {
+            // wait until last workgroup does reduction
+            while (!(value & constants_t::GLOBAL_COUNT))
+            {
+                // value = spirv::atomicLoad(scratchLast.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask);
+                value = spirv::atomicIAdd(scratchLast.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask, 0u);
+            }
+        }
+        value = workgroup::Broadcast(value, sharedMemScratchAccessor, Config::WorkgroupSize-1);
+        return value & (~constants_t::STATUS_MASK);


This won't work even with forward progress guarantees, you just need to let the workgroup quit

devshgraphicsprogramming · 2025-06-11T08:56:47Z

include/nbl/builtin/hlsl/scan/arithmetic_impl.hlsl

+    template<class DataAccessor, class ScratchAccessor>
+    scalar_t __call(NBL_REF_ARG(DataAccessor) dataAccessor, NBL_REF_ARG(ScratchAccessor) sharedMemScratchAccessor)


you have the readonly accessor to get your element, and you have the scratch memory accessor (for the workgroup scans/reductions) but you don't have:

accessor for the Device-Scope scratch

accessor for where to store the reduction result (reduction is special compared to a scan, you can't get the result right away)

devshgraphicsprogramming · 2025-06-11T09:16:13Z

include/nbl/builtin/hlsl/scan/arithmetic_impl.hlsl

+        if (lastInvocation)
+        {
+            bda::__ref<scalar_t> scratchId = (scratch + glsl::gl_WorkGroupID().x).deref();
+            spirv::atomicUMax(scratchId.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsReleaseMask, localReduction|constants_t::LOCAL_COUNT);


you want to separate the storage of the reduction from the flags I think.

devshgraphicsprogramming · 2025-06-11T09:22:28Z

include/nbl/builtin/hlsl/scan/arithmetic_impl.hlsl

+                for (uint32_t i = 1; i <= glsl::gl_WorkGroupID().x; i++)
+                {
+                    const uint32_t prevID = glsl::gl_WorkGroupID().x-i;


don't use gl_WorkGroupID ask for the virtualWorkgroupIndex in the function call

devshgraphicsprogramming · 2025-06-11T09:38:09Z

include/nbl/builtin/hlsl/scan/arithmetic_impl.hlsl

+                            // value = spirv::atomicLoad(scratchPrev.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask);
+                            value = spirv::atomicIAdd(scratchPrev.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask, 0u);


you'll have multiple workgroups doing this, you'll mess up the results, you want to accumulate those locally here in a register

so while you're walking backwards, you only do prefix += atomicLoad(scratchPrev.__get_spv_ptr(), spv::ScopeDevice, MakeVisible);

Also this requires that you have two different scratch store locations:

one for local reduction

one for global reduction

If you keep the global and local on the same address, you get nasty data races because the status is a flag and not a mutex, so you can overwrite a local result with a global while another workgroup reads and it thinks that the value it reads is a local result because flag is not updated yet.

P.S. You probably don't want to be writing out the GLOBAL results and updating status flags here even though you can, because other workgroups before you are obviously "just about" to write out their results, and this way you'll just introduce more uncached memory traffic.

devshgraphicsprogramming · 2025-06-11T09:43:48Z

include/nbl/builtin/hlsl/scan/arithmetic_impl.hlsl

+            if (lastInvocation) // don't make whole block work and do busy stuff
+            {
+                // for (uint32_t prevID = glsl::gl_WorkGroupID().x-1; prevID >= 0u; prevID--)   // won't run properly this way for some reason, results in device lost
+                for (uint32_t i = 1; i <= glsl::gl_WorkGroupID().x; i++)


actually using the whole workgroup or at least subgroup (benchmark it) would be much faster here, so each invocation checks a workgroup and you can use workgroup2::reduce with a Max binop to find the first preceeding workgroup with a ready GLOBAL scan value

You'd also be able to accumulate the prefix faster over the ones which have LOCAL ready

keptsecret added 7 commits May 26, 2025 11:42

removed redundant includes

c23050c

added atomic store, load ; int64 specs for others

0ccd13f

removed unused files in hlsl/scan

7e1b0c3

initial global reduce impl

9666ce4

merge workgroup scan changes

1f67be3

get example

fa7151e

fix missing bits in reduce

20d56d8

devshgraphicsprogramming commented Jun 3, 2025

View reviewed changes

include/nbl/builtin/hlsl/scan/arithmetic.hlsl Outdated Show resolved Hide resolved

devshgraphicsprogramming commented Jun 3, 2025

View reviewed changes

keptsecret added 6 commits June 9, 2025 14:01

merge upstream, fix conflicts

0ee12b0

bug fixes so shader compiles now, but infinite loop suspected

752d943

Merge branch 'improve-workgroup-scan-2' into chained_scan

0098b3e

added branch for no forward progress guarantee (no spin wait)

6461b36

bug fixes to indexing, forward progress guarantee works now

e291940

fix to without forward progress guarantee, >2 workgroups broken somehow

8665fcc

devshgraphicsprogramming commented Jun 11, 2025

View reviewed changes

fix to atomic load/store intrinsics

7cde620

devshgraphicsprogramming commented Jun 11, 2025

View reviewed changes

keptsecret added 3 commits June 12, 2025 16:40

fix global reduction (only plus atm), moved existing to temp scan

e4a8ac2

reduction specializations for other arithmetic ops

426fa6b

cleaning up reduction

57f4559

Base automatically changed from improve-workgroup-scan-2 to master June 16, 2025 11:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block Chain Scan#884

Block Chain Scan#884
devshgraphicsprogramming wants to merge 17 commits intomasterfrom
chained_scan

devshgraphicsprogramming commented Jun 3, 2025

Uh oh!

devshgraphicsprogramming Jun 3, 2025

Uh oh!

Uh oh!

devshgraphicsprogramming Jun 3, 2025

Uh oh!

devshgraphicsprogramming Jun 3, 2025

Uh oh!

devshgraphicsprogramming Jun 3, 2025

Uh oh!

devshgraphicsprogramming Jun 11, 2025 •

edited

Loading

Uh oh!

devshgraphicsprogramming Jun 11, 2025

Uh oh!

devshgraphicsprogramming Jun 11, 2025

Uh oh!

devshgraphicsprogramming Jun 11, 2025

Uh oh!

devshgraphicsprogramming Jun 11, 2025

Uh oh!

devshgraphicsprogramming Jun 11, 2025 •

edited

Loading

Uh oh!

devshgraphicsprogramming Jun 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		template<class ReadOnlyDataAccessor, class ScratchAccessor NBL_FUNC_REQUIRES(workgroup2::ArithmeticReadOnlyDataAccessor<ReadOnlyDataAccessor,scalar_t> && workgroup2::ArithmeticSharedMemoryAccessor<ScratchAccessor,scalar_t>)
		static scalar_t __call(NBL_REF_ARG(ReadOnlyDataAccessor) dataAccessor, NBL_REF_ARG(ScratchAccessor) sharedMemScratchAccessor)

		template<class DataAccessor, class ScratchAccessor>
		scalar_t __call(NBL_REF_ARG(DataAccessor) dataAccessor, NBL_REF_ARG(ScratchAccessor) sharedMemScratchAccessor)

		// value = spirv::atomicLoad(scratchPrev.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask);
		value = spirv::atomicIAdd(scratchPrev.__get_spv_ptr(), spv::ScopeWorkgroup, spv::MemorySemanticsAcquireMask, 0u);

Conversation

devshgraphicsprogramming commented Jun 3, 2025

Description

Testing

TODO list:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devshgraphicsprogramming Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devshgraphicsprogramming Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

devshgraphicsprogramming Jun 11, 2025 •

edited

Loading

devshgraphicsprogramming Jun 11, 2025 •

edited

Loading