Improve DLPack support for external tensor consumption#6261
Improve DLPack support for external tensor consumption#6261JanuszL wants to merge 22 commits intoNVIDIA:mainfrom
Conversation
895265b to
e432f38
Compare
|
!build |
|
CI MESSAGE: [46421794]: BUILD STARTED |
Greptile SummaryThis PR improves DLPack support across DALI's dynamic mode by adding C++ bulk constructors (
Confidence Score: 3/5Safe to merge after fixing the GPU DLPack dtype-check ordering; all other paths are correctly implemented. One confirmed P1 bug: GPU DLPack capsules are consumed by from_dlpack_list before the dtype compatibility check, so a dtype mismatch silently falls through to the slow path with exhausted capsules. This relies on PyTorch-specific behaviour and breaks for any strict DLPack producer. Score is 3 (below the P1 ceiling of 4) because the affected code path is exercised by the new test suite but the test itself relies on the same PyTorch leniency that hides the bug. dali/python/nvidia/dali/experimental/dynamic/_batch.py — GPU DLPack fast path dtype-guard ordering (lines 534–553) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["Batch.__init__(tensors=list)"] --> B{list empty?}
B -- yes --> SLOW
B -- no --> C{First elem is ndd.Tensor with _storage?}
C -- yes --> D["Native DALI fast path"]
D --> D1{Same type and device?}
D1 -- yes --> D2["fast_path_used=True"]
D1 -- no --> E
C -- no --> E{Has __dlpack_device__?}
E -- yes --> F{dev_type == GPU?}
F -- yes --> G["GPU DLPack: from_dlpack_list — capsules consumed"]
G --> G1{"dtype match? checked AFTER consume"}
G1 -- yes --> G2["fast_path_used=True"]
G1 -- no --> SLOW
F -- no --> H{dev_type == CPU?}
H -- yes --> I["CPU DLPack: from_dlpack_list"]
I --> I1{success?}
I1 -- yes --> I2["fast_path_used=True"]
I1 -- no --> SLOW
E -- no --> SLOW
H -- no --> SLOW
SLOW["Slow path: per-sample Tensor loop"]
SLOW --> J["Result: Batch"]
D2 --> J
G2 --> J
I2 --> J
|
e432f38 to
4b868b5
Compare
|
@greptileai - can you re-review? |
d1c2614 to
c9793e7
Compare
|
!build |
|
CI MESSAGE: [46425910]: BUILD STARTED |
c9793e7 to
50843aa
Compare
|
@greptileai - can you rereview? |
|
CI MESSAGE: [46425910]: BUILD PASSED |
61611fe to
b7a90c8
Compare
|
!build |
|
CI MESSAGE: [46436927]: BUILD STARTED |
|
@greptileai - can you rereview? |
|
CI MESSAGE: [46436927]: BUILD PASSED |
|
!build |
|
CI MESSAGE: [48976609]: BUILD STARTED |
|
CI MESSAGE: [48976609]: BUILD FAILED |
|
!build |
|
CI MESSAGE: [49016406]: BUILD STARTED |
|
CI MESSAGE: [49016406]: BUILD PASSED |
- Adds C++ bulk DLPack constructor for TensorListGPU: accepts a Python
list of DLPack-compatible objects (e.g. PyTorch GPU tensors) and
builds the TensorList in a single pass, recording a CUDA event on
the provided stream. Passes `stream` as a keyword argument to
`__dlpack__()` for compatibility with NumPy ≥ 1.22 and JAX which
define `def __dlpack__(self, *, stream=None)`.
- Adds native DALI fast path in Batch.__init__: when given a list of
already-evaluated ndd.Tensor objects, pass their _storage objects
directly to TensorListGPU/CPU constructors, preserving all DALI
metadata (layout, enum types) without going through DLPack.
- Adds GPU DLPack fast path in Batch.__init__: when given a list of
external GPU tensors (e.g. PyTorch) that support DLPack, use the
new C++ bulk constructor to avoid per-tensor Python overhead.
- Adds DLPack fallback for CPU read-only arrays in Tensor.__init__:
catch BufferError from __dlpack__() and fall back to __array__
interface. This makes the following work:
arr = np.array([1, 2, 3])
arr.flags.writeable = False
ndd.as_tensor(arr) # previously raised BufferError
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
- Rename TensorListGPU DLPack list constructor to from_dlpack_list static method to avoid ambiguity with the list-of-TensorGPU constructor - Add TensorListFromListOfDLPackObjectsCPU and TensorListCPU.from_dlpack_list for CPU DLPack tensors; extend _batch.py fast path to kDLCPU - Fix stream_handle assignment to only fall back to raw stream object when it is an integer, avoiding forwarding non-integer stream wrappers to __dlpack__ - Add GPU DLPack fallback to __cuda_array_interface__ in _tensor.py when __dlpack__ raises BufferError or TypeError - Drop underscore prefix from all fast-path locals; rename _backend_type to BackendType and _stream to cuda_stream - Remove overly conservative dtype is None guard from both fast-path outer conditions; check dtype per-tensor in the native loop, post-call in DLPack - Add per-tensor GPU device-ID consistency check inside the native fast-path loop using first_dev_id; break early on device mismatch - Replace _native_valid boolean flag with for-else idiom Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Catch ValueError (mixed-device C++ error from from_dlpack_list) and BufferError (failed __dlpack__ handshake) in addition to TypeError, so both GPU and CPU DLPack fast paths fall back to per-sample Tensor construction instead of propagating errors to the caller. Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
- Native fast path: matching dtype keeps fast path; mismatched dtype converts via slow path - GPU DLPack BufferError in batch-level from_dlpack_list falls back to slow path using __cuda_array_interface__ per tensor - GPU Tensor construction falls back to __cuda_array_interface__ when __dlpack__ raises BufferError - Mixed-GPU DLPack batch: ValueError from from_dlpack_list falls back to slow path (tagged multi_gpu, skipped when < 2 GPUs) - Native fast path mixed-GPU: device-ID mismatch check triggers slow path fallback (tagged multi_gpu, skipped when < 2 GPUs) Multi-GPU tests are tagged @attr("multi_gpu") / @attr("pytorch","multi_gpu") so they are picked up by TL0_multigpu/test_body.sh without shell changes. Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Stream handshake (blocking): Derive the DLPack consumer-stream handle from copy_order instead of inspecting the Python stream object's .handle attribute. copy_order is computed via AccessOrderFromPythonStreamObj() which handles all known stream wrapper types; reusing it for the __dlpack__(stream=...) argument guarantees the producer always synchronises with the exact stream DALI will use for the copy. Read-only CPU buffer (blocking): When __dlpack__() raises BufferError on a CPU tensor (e.g. read-only NumPy array), the fallback now copies the data via np.array(..., copy=True) instead of zero-copying. This prevents DALI from holding a mutable alias of memory the producer explicitly marked as immutable. TypeError (older DLPack, unsupported dtype) still falls back without copying. GPU DLPack CAI fallback (medium): The GPU Tensor constructor no longer catches BufferError before trying __cuda_array_interface__. A BufferError on a GPU DLPack call may signal a synchronisation or ownership constraint; silently switching to CAI (which has no consumer-stream handshake) could cause stale reads. Only TypeError (protocol mismatch, safe to fall back) is caught. Update tests: _BrokenDlpack -> _TypeErrorDlpack (raises TypeError) to remain consistent with the tightened GPU except clause. Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
…allback - Add set_order(copy_order) to both non-contiguous and contiguous return paths of TensorListFromListOfDLPackObjects so that AccessOrder metadata is propagated identically to the single-tensor path - Retry __dlpack__() without stream kwarg before falling back to CAI; a TypeError from the stream-aware call means the producer refused the handoff, not that DLPack is unavailable entirely Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
… and device check - Add a preflight loop that calls __dlpack_device__() for every object before any __dlpack__() call: validates device homogeneity without consuming the capsule, and resolves the UserStream for the batch device (via UserStream::Get()->GetStream(device_id)) so stream_handle is correct for all tensors including the first one - When __dlpack_device__ is absent, fall back to the per-tensor UserStream after i==0 (existing behavior) and recompute stream_handle so tensors 1..N still receive the correct consumer-stream handle - Remove the redundant per-element __dlpack__ device check (now covered by the preflight); retain the post-FillTensorFromDlPack check as a safety net for objects that lack __dlpack_device__ Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
TensorList::VerifySampleShareCompatibility enforced that every sample shared into a TensorList must have the same is_pinned() value as the batch. For CPU this is correct: host pinned and non-pinned allocations live in different memory spaces and must not be mixed. For GPU the constraint is unnecessarily strict: SetSample is only ever called in non-contiguous mode (it calls MakeNoncontiguous() first), so each sample owns its own independent allocation. A GPU tensor backed by CUDA pinned host memory (is_pinned()==true) and a regular device tensor (is_pinned()==false) can coexist in the same non-contiguous TensorList because downstream per-sample access respects each sample's own is_pinned flag. The strict check caused a regression in the dynamic-API cross-GPU batch slow path: copying a tensor from GPU N to GPU 0 via DALI's copy operator routes through a pinned staging buffer, yielding a TensorGPU with is_pinned()==true. Assembling this together with non-pinned tensors from GPU 0 into a non-contiguous TensorListGPU then failed the pinned uniformity check with "Sample must have the same pinned status as target batch". Fix: guard the pinned check with if constexpr (CPUBackend only), leaving the device_id check intact for both backends. Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Two P2 issues found in code review: 1. backend_impl.cc (TensorListFromListOfDLPackObjects): When no input object implements __dlpack_device__, the preflight loop left copy_order == AccessOrder::host() and stream_handle == None. The main loop then called __dlpack__(stream=None) for tensor 0, meaning the DLPack producer was not informed of the DALI consumer stream and could race against DALI's copy if it had in-flight GPU work. Fix: resolve copy_order via cudaGetDevice() after the preflight loop so all __dlpack__() calls — including tensor 0 — receive the same concrete stream handle. Remove the now-dead per-tensor fallback that only fired after tensor 0 was already consumed. 2. _tensor.py (GPU DLPack path in Tensor.__init__): The outer try/except TypeError wrapped both data.__dlpack__(**args) and _backend.TensorGPU(dlpack_capsule, ...). If __dlpack__(**args) succeeded but TensorGPU raised TypeError (e.g. unsupported dtype), the capsule was already consumed; the except block then called data.__dlpack__() a second time. DLPack capsules are single-use, so the second call is undefined behaviour. Fix: separate capsule acquisition from TensorGPU construction. Only the __dlpack__() call itself is inside the outer try; once a capsule is obtained, TensorGPU construction is outside any retry block. Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
VerifySampleShareCompatibility no longer enforces is_pinned() uniformity for GPUBackend: pinned (host-accessible) and regular device tensors can coexist in a non-contiguous TensorListGPU because each sample owns its own independent allocation. SetRequiredSetters now conditionally includes the "pinned" entry only for CPUBackend, matching the enforcement in VerifySampleShareCompatibility. GPU TensorList permutation iterations no longer expect an exception when the "pinned" setter is excluded. Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Copy(non_contiguous, copy_order) queues GPU work on copy_order, but set_order(copy_order) was never called on the returned TensorList. This left it with AccessOrder::host(), causing downstream consumers to use a host barrier instead of the correct GPU stream, allowing races with the in-flight copy. Mirrors the existing set_order calls in TensorListFromListOfDLPackObjects. Both the contiguous and non-contiguous GPU paths are fixed. Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
wait_order was always AccessOrder::host() and never updated, making both copy_order.wait(wait_order) calls inert. Remove the variable and the two dead wait calls, matching the DLPack path which omits them entirely. Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Relax FillTensorFromDlPack<CPUBackend> to accept kDLCUDAHost in addition to kDLCPU - the function body already handled the pinned-memory case (is_pinned flag, device_id) but the DALI_ENFORCE blocked it from being reached. Extend the DLPack fast path in _batch.py to match, so pinned CPU tensors (e.g. PyTorch tensor.pin_memory()) use the fast path instead of falling through to the slow path. Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Remove BufferError from the GPU DLPack fast-path except clause in _batch.py — consistent with _tensor.py's documented intent that BufferError must not be caught on GPU, and prevents a 'capsule already consumed' error if the slow path is entered after partial consumption. Add a preflight loop to TensorListFromListOfDLPackObjectsCPU mirroring the GPU counterpart: checks __dlpack_device__() on every element before any __dlpack__() call, throwing py::value_error (caught by the Python fast-path except clause) for non-CPU device types instead of leaking a RuntimeError from DALI_ENFORCE after capsules are partially consumed. Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
720086b to
c553469
Compare
Adds C++ bulk DLPack constructor for TensorListGPU: accepts a Python
list of DLPack-compatible objects (e.g. PyTorch GPU tensors) and
builds the TensorList in a single pass, recording a CUDA event on
the provided stream. Passes
streamas a keyword argument to__dlpack__()for compatibility with NumPy ≥ 1.22 and JAX whichdefine
def __dlpack__(self, *, stream=None).Adds native DALI fast path in Batch.init: when given a list of
already-evaluated ndd.Tensor objects, pass their _storage objects
directly to TensorListGPU/CPU constructors, preserving all DALI
metadata (layout, enum types) without going through DLPack.
Adds GPU DLPack fast path in Batch.init: when given a list of
external GPU tensors (e.g. PyTorch) that support DLPack, use the
new C++ bulk constructor to avoid per-tensor Python overhead.
Adds DLPack fallback for CPU read-only arrays in Tensor.init:
catch BufferError from dlpack() and fall back to array
interface. This makes the following work:
arr = np.array([1, 2, 3])
arr.flags.writeable = False
ndd.as_tensor(arr) # previously raised BufferError
Fixes pinned-memory mismatch in the cross-GPU batch slow path:
TensorList::VerifySampleShareCompatibility no longer enforces
is_pinned() uniformity for GPUBackend (only CPUBackend). DALI's
copy operator routes cross-GPU transfers through a pinned staging
buffer, producing a TensorGPU with is_pinned()==true. Combining
such a tensor with non-pinned samples in a non-contiguous
TensorListGPU previously raised "Sample must have the same pinned
status as target batch", causing as_batch() to fail for mixed-GPU
inputs. For GPU, each sample in a non-contiguous TensorList owns
its own allocation, so mixed pinned/non-pinned is safe.
Category:
Bug fix (non-breaking change which fixes an issue)
Description:
Adds C++ bulk DLPack constructor for TensorListGPU: accepts a Python
list of DLPack-compatible objects (e.g. PyTorch GPU tensors) and
builds the TensorList in a single pass, recording a CUDA event on
the provided stream. Passes
streamas a keyword argument to__dlpack__()for compatibility with NumPy ≥ 1.22 and JAX whichdefine
def __dlpack__(self, *, stream=None).Adds native DALI fast path in Batch.init: when given a list of
already-evaluated ndd.Tensor objects, pass their _storage objects
directly to TensorListGPU/CPU constructors, preserving all DALI
metadata (layout, enum types) without going through DLPack.
Adds GPU DLPack fast path in Batch.init: when given a list of
external GPU tensors (e.g. PyTorch) that support DLPack, use the
new C++ bulk constructor to avoid per-tensor Python overhead.
Adds DLPack fallback for CPU read-only arrays in Tensor.init:
catch BufferError from dlpack() and fall back to array
interface. This makes the following work:
arr = np.array([1, 2, 3])
arr.flags.writeable = False
ndd.as_tensor(arr) # previously raised BufferError
Fixes pinned-memory mismatch in the cross-GPU batch slow path:
TensorList::VerifySampleShareCompatibility no longer enforces
is_pinned() uniformity for GPUBackend (only CPUBackend). DALI's
copy operator routes cross-GPU transfers through a pinned staging
buffer, producing a TensorGPU with is_pinned()==true. Combining
such a tensor with non-pinned samples in a non-contiguous
TensorListGPU previously raised "Sample must have the same pinned
status as target batch", causing as_batch() to fail for mixed-GPU
inputs. For GPU, each sample in a non-contiguous TensorList owns
its own allocation, so mixed pinned/non-pinned is safe.
Additional information:
Affected modules and functionalities:
Key points relevant for the review:
Tests:
-extended test_tensor.py
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: DALI-4580