Skip to content

Improve DLPack support for external tensor consumption#6261

Open
JanuszL wants to merge 22 commits intoNVIDIA:mainfrom
JanuszL:dynamic_as_batch_fast_path
Open

Improve DLPack support for external tensor consumption#6261
JanuszL wants to merge 22 commits intoNVIDIA:mainfrom
JanuszL:dynamic_as_batch_fast_path

Conversation

@JanuszL
Copy link
Copy Markdown
Contributor

@JanuszL JanuszL commented Mar 18, 2026

  • Adds C++ bulk DLPack constructor for TensorListGPU: accepts a Python
    list of DLPack-compatible objects (e.g. PyTorch GPU tensors) and
    builds the TensorList in a single pass, recording a CUDA event on
    the provided stream. Passes stream as a keyword argument to
    __dlpack__() for compatibility with NumPy ≥ 1.22 and JAX which
    define def __dlpack__(self, *, stream=None).

  • Adds native DALI fast path in Batch.init: when given a list of
    already-evaluated ndd.Tensor objects, pass their _storage objects
    directly to TensorListGPU/CPU constructors, preserving all DALI
    metadata (layout, enum types) without going through DLPack.

  • Adds GPU DLPack fast path in Batch.init: when given a list of
    external GPU tensors (e.g. PyTorch) that support DLPack, use the
    new C++ bulk constructor to avoid per-tensor Python overhead.

  • Adds DLPack fallback for CPU read-only arrays in Tensor.init:
    catch BufferError from dlpack() and fall back to array
    interface. This makes the following work:

    arr = np.array([1, 2, 3])
    arr.flags.writeable = False
    ndd.as_tensor(arr) # previously raised BufferError

  • Fixes pinned-memory mismatch in the cross-GPU batch slow path:
    TensorList::VerifySampleShareCompatibility no longer enforces
    is_pinned() uniformity for GPUBackend (only CPUBackend). DALI's
    copy operator routes cross-GPU transfers through a pinned staging
    buffer, producing a TensorGPU with is_pinned()==true. Combining
    such a tensor with non-pinned samples in a non-contiguous
    TensorListGPU previously raised "Sample must have the same pinned
    status as target batch", causing as_batch() to fail for mixed-GPU
    inputs. For GPU, each sample in a non-contiguous TensorList owns
    its own allocation, so mixed pinned/non-pinned is safe.

Category:

Bug fix (non-breaking change which fixes an issue)

Description:

  • Adds C++ bulk DLPack constructor for TensorListGPU: accepts a Python
    list of DLPack-compatible objects (e.g. PyTorch GPU tensors) and
    builds the TensorList in a single pass, recording a CUDA event on
    the provided stream. Passes stream as a keyword argument to
    __dlpack__() for compatibility with NumPy ≥ 1.22 and JAX which
    define def __dlpack__(self, *, stream=None).

  • Adds native DALI fast path in Batch.init: when given a list of
    already-evaluated ndd.Tensor objects, pass their _storage objects
    directly to TensorListGPU/CPU constructors, preserving all DALI
    metadata (layout, enum types) without going through DLPack.

  • Adds GPU DLPack fast path in Batch.init: when given a list of
    external GPU tensors (e.g. PyTorch) that support DLPack, use the
    new C++ bulk constructor to avoid per-tensor Python overhead.

  • Adds DLPack fallback for CPU read-only arrays in Tensor.init:
    catch BufferError from dlpack() and fall back to array
    interface. This makes the following work:

    arr = np.array([1, 2, 3])
    arr.flags.writeable = False
    ndd.as_tensor(arr) # previously raised BufferError

  • Fixes pinned-memory mismatch in the cross-GPU batch slow path:
    TensorList::VerifySampleShareCompatibility no longer enforces
    is_pinned() uniformity for GPUBackend (only CPUBackend). DALI's
    copy operator routes cross-GPU transfers through a pinned staging
    buffer, producing a TensorGPU with is_pinned()==true. Combining
    such a tensor with non-pinned samples in a non-contiguous
    TensorListGPU previously raised "Sample must have the same pinned
    status as target batch", causing as_batch() to fail for mixed-GPU
    inputs. For GPU, each sample in a non-contiguous TensorList owns
    its own allocation, so mixed pinned/non-pinned is safe.

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
      • extended test_batch.py
        -extended test_tensor.py
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-4580

@JanuszL JanuszL force-pushed the dynamic_as_batch_fast_path branch from 895265b to e432f38 Compare March 18, 2026 11:01
@JanuszL
Copy link
Copy Markdown
Contributor Author

JanuszL commented Mar 18, 2026

!build

Comment thread dali/python/nvidia/dali/experimental/dynamic/_batch.py Fixed
@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [46421794]: BUILD STARTED

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 18, 2026

Greptile Summary

This PR improves DLPack support across DALI's dynamic mode by adding C++ bulk constructors (TensorListGPU/CPU.from_dlpack_list), three fast paths in Batch.__init__ (native DALI storage, GPU DLPack, CPU DLPack), a BufferError fallback for read-only CPU arrays in Tensor.__init__, and a pinned-memory check relaxation for GPUBackend in VerifySampleShareCompatibility.

  • P1 — The GPU DLPack fast path in _batch.py calls TensorListGPU.from_dlpack_list (consuming all capsules) before checking whether the resulting storage dtype matches the requested dtype. If there is a mismatch, fast_path_used stays False and the slow path re-calls __dlpack__() on already-consumed capsules, violating the DLPack single-use contract. The test at line 777 documents the intended behaviour ("fast path is skipped") but the guard is missing.

Confidence Score: 3/5

Safe to merge after fixing the GPU DLPack dtype-check ordering; all other paths are correctly implemented.

One confirmed P1 bug: GPU DLPack capsules are consumed by from_dlpack_list before the dtype compatibility check, so a dtype mismatch silently falls through to the slow path with exhausted capsules. This relies on PyTorch-specific behaviour and breaks for any strict DLPack producer. Score is 3 (below the P1 ceiling of 4) because the affected code path is exercised by the new test suite but the test itself relies on the same PyTorch leniency that hides the bug.

dali/python/nvidia/dali/experimental/dynamic/_batch.py — GPU DLPack fast path dtype-guard ordering (lines 534–553)

Important Files Changed

Filename Overview
dali/pipeline/data/tensor_list.cc Removes pinned-memory uniformity check for GPUBackend in VerifySampleShareCompatibility; check retained for CPUBackend. Correct fix for mixed-pinned cross-GPU batches.
dali/python/backend_impl.cc Adds TensorListFromListOfDLPackObjects (GPU) and TensorListFromListOfDLPackObjectsCPU; fixes set_order in TensorListFromListOfTensors; adds device-ID consistency check; set_order is now correctly called on both contiguous and non-contiguous GPU output paths.
dali/python/nvidia/dali/experimental/dynamic/_batch.py Adds three fast paths (native DALI, GPU DLPack, CPU DLPack) to Batch.init; GPU DLPack fast path consumes capsules before checking dtype match, risking use-after-consume in the slow-path fallback.
dali/python/nvidia/dali/experimental/dynamic/_tensor.py Adds BufferError fallback for read-only CPU arrays; adds TypeError retry without stream keyword for older GPU DLPack producers; correctly separates capsule acquisition from TensorGPU construction to avoid double-consume.
dali/util/user_stream.h GetStream(size_t dev) moved earlier in the file for forward-declaration purposes; no behavior change.
dali/test/python/experimental_mode/test_batch.py Adds 10 new test cases covering native DALI, GPU DLPack, CPU DLPack fast paths, dtype fallback, and multi-GPU error paths.
dali/test/python/experimental_mode/test_tensor.py Adds tests for read-only numpy scalars/arrays and GPU cuda_array_interface fallback when dlpack raises TypeError.
qa/TL0_multigpu/test_body.sh Adds multi-GPU + cupy test execution to the TL0 QA script.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Batch.__init__(tensors=list)"] --> B{list empty?}
    B -- yes --> SLOW
    B -- no --> C{First elem is ndd.Tensor with _storage?}
    C -- yes --> D["Native DALI fast path"]
    D --> D1{Same type and device?}
    D1 -- yes --> D2["fast_path_used=True"]
    D1 -- no --> E
    C -- no --> E{Has __dlpack_device__?}
    E -- yes --> F{dev_type == GPU?}
    F -- yes --> G["GPU DLPack: from_dlpack_list — capsules consumed"]
    G --> G1{"dtype match? checked AFTER consume"}
    G1 -- yes --> G2["fast_path_used=True"]
    G1 -- no --> SLOW
    F -- no --> H{dev_type == CPU?}
    H -- yes --> I["CPU DLPack: from_dlpack_list"]
    I --> I1{success?}
    I1 -- yes --> I2["fast_path_used=True"]
    I1 -- no --> SLOW
    E -- no --> SLOW
    H -- no --> SLOW
    SLOW["Slow path: per-sample Tensor loop"]
    SLOW --> J["Result: Batch"]
    D2 --> J
    G2 --> J
    I2 --> J
Loading

Comments Outside Diff (1)

  1. dali/python/nvidia/dali/experimental/dynamic/_batch.py, line 534-553 (link)

    GPU DLPack capsules consumed before dtype mismatch is detected

    TensorListGPU.from_dlpack_list iterates through all elements and consumes every DLPack capsule before returning. The dtype compatibility check runs after that C++ call returns (line 544). If the storage dtype doesn't match the requested dtype, fast_path_used stays False and the slow path re-calls __dlpack__() on every tensor — but the capsules are already consumed. This works for PyTorch (which creates a new capsule per call) but violates the DLPack single-use guarantee and will silently fail or crash for any producer that enforces it.

    The test comment at line 777 even says "When dtype is specified the fast path is skipped", which is the intent — but the code doesn't actually skip it.

    The simplest fix is to gate the DLPack fast path on dtype is None, since type conversion requires the slow path anyway:

    if not fast_path_used and (
        dtype is None  # skip if conversion needed — can't peek dtype without consuming capsule
        and len(tensors_list) > 0
        and not isinstance(tensors_list[0], Tensor)
        and hasattr(tensors_list[0], "__dlpack_device__")
    ):

Fix All in Claude Code

Reviews (30): Last reviewed commit: "Fix" | Re-trigger Greptile

Comment thread dali/python/nvidia/dali/experimental/dynamic/_batch.py Outdated
Comment thread dali/python/backend_impl.cc
Comment thread dali/python/nvidia/dali/experimental/dynamic/_batch.py Outdated
Comment thread dali/python/nvidia/dali/experimental/dynamic/_tensor.py Outdated
@JanuszL JanuszL marked this pull request as draft March 18, 2026 11:52
@JanuszL JanuszL force-pushed the dynamic_as_batch_fast_path branch from e432f38 to 4b868b5 Compare March 18, 2026 11:56
@JanuszL
Copy link
Copy Markdown
Contributor Author

JanuszL commented Mar 18, 2026

@greptileai - can you re-review?

@JanuszL JanuszL force-pushed the dynamic_as_batch_fast_path branch 2 times, most recently from d1c2614 to c9793e7 Compare March 18, 2026 12:09
@JanuszL JanuszL marked this pull request as ready for review March 18, 2026 12:11
@JanuszL
Copy link
Copy Markdown
Contributor Author

JanuszL commented Mar 18, 2026

!build

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [46425910]: BUILD STARTED

Comment thread dali/python/backend_impl.cc Outdated
@JanuszL JanuszL force-pushed the dynamic_as_batch_fast_path branch from c9793e7 to 50843aa Compare March 18, 2026 12:34
@JanuszL
Copy link
Copy Markdown
Contributor Author

JanuszL commented Mar 18, 2026

@greptileai - can you rereview?

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [46425910]: BUILD PASSED

@JanuszL JanuszL force-pushed the dynamic_as_batch_fast_path branch 3 times, most recently from 61611fe to b7a90c8 Compare March 18, 2026 15:13
@JanuszL
Copy link
Copy Markdown
Contributor Author

JanuszL commented Mar 18, 2026

!build

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [46436927]: BUILD STARTED

Comment thread dali/python/nvidia/dali/experimental/dynamic/_batch.py Dismissed
@JanuszL
Copy link
Copy Markdown
Contributor Author

JanuszL commented Mar 18, 2026

@greptileai - can you rereview?

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [46436927]: BUILD PASSED

Comment thread dali/python/nvidia/dali/experimental/dynamic/_batch.py Outdated
Comment thread dali/python/nvidia/dali/experimental/dynamic/_batch.py Outdated
Comment thread dali/python/nvidia/dali/experimental/dynamic/_batch.py Outdated
Comment thread dali/python/nvidia/dali/experimental/dynamic/_batch.py Outdated
Comment thread dali/python/nvidia/dali/experimental/dynamic/_batch.py Outdated
Comment thread dali/python/nvidia/dali/experimental/dynamic/_batch.py Outdated
Comment thread dali/python/nvidia/dali/experimental/dynamic/_batch.py Outdated
Comment thread dali/python/nvidia/dali/experimental/dynamic/_tensor.py Outdated
Comment thread dali/python/backend_impl.cc Outdated
Comment thread dali/python/backend_impl.cc Outdated
@JanuszL
Copy link
Copy Markdown
Contributor Author

JanuszL commented Apr 20, 2026

!build

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [48976609]: BUILD STARTED

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [48976609]: BUILD FAILED

@JanuszL
Copy link
Copy Markdown
Contributor Author

JanuszL commented Apr 20, 2026

!build

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [49016406]: BUILD STARTED

@dali-automaton
Copy link
Copy Markdown
Collaborator

CI MESSAGE: [49016406]: BUILD PASSED

JanuszL added 22 commits April 27, 2026 12:36
- Adds C++ bulk DLPack constructor for TensorListGPU: accepts a Python
  list of DLPack-compatible objects (e.g. PyTorch GPU tensors) and
  builds the TensorList in a single pass, recording a CUDA event on
  the provided stream. Passes `stream` as a keyword argument to
  `__dlpack__()` for compatibility with NumPy ≥ 1.22 and JAX which
  define `def __dlpack__(self, *, stream=None)`.
- Adds native DALI fast path in Batch.__init__: when given a list of
  already-evaluated ndd.Tensor objects, pass their _storage objects
  directly to TensorListGPU/CPU constructors, preserving all DALI
  metadata (layout, enum types) without going through DLPack.
- Adds GPU DLPack fast path in Batch.__init__: when given a list of
  external GPU tensors (e.g. PyTorch) that support DLPack, use the
  new C++ bulk constructor to avoid per-tensor Python overhead.
- Adds DLPack fallback for CPU read-only arrays in Tensor.__init__:
  catch BufferError from __dlpack__() and fall back to __array__
  interface. This makes the following work:

    arr = np.array([1, 2, 3])
    arr.flags.writeable = False
    ndd.as_tensor(arr)  # previously raised BufferError

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
- Rename TensorListGPU DLPack list constructor to from_dlpack_list static
  method to avoid ambiguity with the list-of-TensorGPU constructor
- Add TensorListFromListOfDLPackObjectsCPU and TensorListCPU.from_dlpack_list
  for CPU DLPack tensors; extend _batch.py fast path to kDLCPU
- Fix stream_handle assignment to only fall back to raw stream object when
  it is an integer, avoiding forwarding non-integer stream wrappers to __dlpack__
- Add GPU DLPack fallback to __cuda_array_interface__ in _tensor.py when
  __dlpack__ raises BufferError or TypeError
- Drop underscore prefix from all fast-path locals; rename _backend_type to
  BackendType and _stream to cuda_stream
- Remove overly conservative dtype is None guard from both fast-path outer
  conditions; check dtype per-tensor in the native loop, post-call in DLPack
- Add per-tensor GPU device-ID consistency check inside the native fast-path
  loop using first_dev_id; break early on device mismatch
- Replace _native_valid boolean flag with for-else idiom

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Catch ValueError (mixed-device C++ error from from_dlpack_list) and
BufferError (failed __dlpack__ handshake) in addition to TypeError,
so both GPU and CPU DLPack fast paths fall back to per-sample Tensor
construction instead of propagating errors to the caller.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
- Native fast path: matching dtype keeps fast path; mismatched dtype
  converts via slow path
- GPU DLPack BufferError in batch-level from_dlpack_list falls back to
  slow path using __cuda_array_interface__ per tensor
- GPU Tensor construction falls back to __cuda_array_interface__ when
  __dlpack__ raises BufferError
- Mixed-GPU DLPack batch: ValueError from from_dlpack_list falls back to
  slow path (tagged multi_gpu, skipped when < 2 GPUs)
- Native fast path mixed-GPU: device-ID mismatch check triggers slow
  path fallback (tagged multi_gpu, skipped when < 2 GPUs)

Multi-GPU tests are tagged @attr("multi_gpu") / @attr("pytorch","multi_gpu")
so they are picked up by TL0_multigpu/test_body.sh without shell changes.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Stream handshake (blocking):
  Derive the DLPack consumer-stream handle from copy_order instead of
  inspecting the Python stream object's .handle attribute.  copy_order is
  computed via AccessOrderFromPythonStreamObj() which handles all known
  stream wrapper types; reusing it for the __dlpack__(stream=...) argument
  guarantees the producer always synchronises with the exact stream DALI
  will use for the copy.

Read-only CPU buffer (blocking):
  When __dlpack__() raises BufferError on a CPU tensor (e.g. read-only
  NumPy array), the fallback now copies the data via np.array(..., copy=True)
  instead of zero-copying.  This prevents DALI from holding a mutable alias
  of memory the producer explicitly marked as immutable.  TypeError (older
  DLPack, unsupported dtype) still falls back without copying.

GPU DLPack CAI fallback (medium):
  The GPU Tensor constructor no longer catches BufferError before trying
  __cuda_array_interface__.  A BufferError on a GPU DLPack call may signal
  a synchronisation or ownership constraint; silently switching to CAI
  (which has no consumer-stream handshake) could cause stale reads.
  Only TypeError (protocol mismatch, safe to fall back) is caught.

  Update tests: _BrokenDlpack -> _TypeErrorDlpack (raises TypeError)
  to remain consistent with the tightened GPU except clause.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
…allback

- Add set_order(copy_order) to both non-contiguous and contiguous return
  paths of TensorListFromListOfDLPackObjects so that AccessOrder metadata
  is propagated identically to the single-tensor path
- Retry __dlpack__() without stream kwarg before falling back to CAI;
  a TypeError from the stream-aware call means the producer refused the
  handoff, not that DLPack is unavailable entirely

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
… and device check

- Add a preflight loop that calls __dlpack_device__() for every object before
  any __dlpack__() call: validates device homogeneity without consuming the
  capsule, and resolves the UserStream for the batch device (via
  UserStream::Get()->GetStream(device_id)) so stream_handle is correct for
  all tensors including the first one
- When __dlpack_device__ is absent, fall back to the per-tensor UserStream
  after i==0 (existing behavior) and recompute stream_handle so tensors
  1..N still receive the correct consumer-stream handle
- Remove the redundant per-element __dlpack__ device check (now covered by
  the preflight); retain the post-FillTensorFromDlPack check as a safety net
  for objects that lack __dlpack_device__

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
TensorList::VerifySampleShareCompatibility enforced that every sample
shared into a TensorList must have the same is_pinned() value as the
batch.  For CPU this is correct: host pinned and non-pinned allocations
live in different memory spaces and must not be mixed.  For GPU the
constraint is unnecessarily strict: SetSample is only ever called in
non-contiguous mode (it calls MakeNoncontiguous() first), so each sample
owns its own independent allocation.  A GPU tensor backed by CUDA pinned
host memory (is_pinned()==true) and a regular device tensor
(is_pinned()==false) can coexist in the same non-contiguous TensorList
because downstream per-sample access respects each sample's own is_pinned
flag.

The strict check caused a regression in the dynamic-API cross-GPU batch
slow path: copying a tensor from GPU N to GPU 0 via DALI's copy operator
routes through a pinned staging buffer, yielding a TensorGPU with
is_pinned()==true.  Assembling this together with non-pinned tensors from
GPU 0 into a non-contiguous TensorListGPU then failed the pinned
uniformity check with "Sample must have the same pinned status as target
batch".

Fix: guard the pinned check with if constexpr (CPUBackend only), leaving
the device_id check intact for both backends.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Two P2 issues found in code review:

1. backend_impl.cc (TensorListFromListOfDLPackObjects):
   When no input object implements __dlpack_device__, the preflight
   loop left copy_order == AccessOrder::host() and stream_handle ==
   None.  The main loop then called __dlpack__(stream=None) for
   tensor 0, meaning the DLPack producer was not informed of the DALI
   consumer stream and could race against DALI's copy if it had
   in-flight GPU work.  Fix: resolve copy_order via cudaGetDevice()
   after the preflight loop so all __dlpack__() calls — including
   tensor 0 — receive the same concrete stream handle.  Remove the
   now-dead per-tensor fallback that only fired after tensor 0 was
   already consumed.

2. _tensor.py (GPU DLPack path in Tensor.__init__):
   The outer try/except TypeError wrapped both data.__dlpack__(**args)
   and _backend.TensorGPU(dlpack_capsule, ...).  If __dlpack__(**args)
   succeeded but TensorGPU raised TypeError (e.g. unsupported dtype),
   the capsule was already consumed; the except block then called
   data.__dlpack__() a second time.  DLPack capsules are single-use,
   so the second call is undefined behaviour.  Fix: separate capsule
   acquisition from TensorGPU construction.  Only the __dlpack__()
   call itself is inside the outer try; once a capsule is obtained,
   TensorGPU construction is outside any retry block.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
VerifySampleShareCompatibility no longer enforces is_pinned() uniformity
for GPUBackend: pinned (host-accessible) and regular device tensors can
coexist in a non-contiguous TensorListGPU because each sample owns its
own independent allocation.

SetRequiredSetters now conditionally includes the "pinned" entry only for
CPUBackend, matching the enforcement in VerifySampleShareCompatibility.
GPU TensorList permutation iterations no longer expect an exception when
the "pinned" setter is excluded.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Copy(non_contiguous, copy_order) queues GPU work on copy_order, but
set_order(copy_order) was never called on the returned TensorList.
This left it with AccessOrder::host(), causing downstream consumers
to use a host barrier instead of the correct GPU stream, allowing
races with the in-flight copy.

Mirrors the existing set_order calls in TensorListFromListOfDLPackObjects.
Both the contiguous and non-contiguous GPU paths are fixed.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
wait_order was always AccessOrder::host() and never updated, making
both copy_order.wait(wait_order) calls inert. Remove the variable
and the two dead wait calls, matching the DLPack path which omits
them entirely.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Relax FillTensorFromDlPack<CPUBackend> to accept kDLCUDAHost in addition
to kDLCPU - the function body already handled the pinned-memory case
(is_pinned flag, device_id) but the DALI_ENFORCE blocked it from being
reached. Extend the DLPack fast path in _batch.py to match, so pinned
CPU tensors (e.g. PyTorch tensor.pin_memory()) use the fast path instead
of falling through to the slow path.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Remove BufferError from the GPU DLPack fast-path except clause in
_batch.py — consistent with _tensor.py's documented intent that
BufferError must not be caught on GPU, and prevents a 'capsule already
consumed' error if the slow path is entered after partial consumption.

Add a preflight loop to TensorListFromListOfDLPackObjectsCPU mirroring
the GPU counterpart: checks __dlpack_device__() on every element before
any __dlpack__() call, throwing py::value_error (caught by the Python
fast-path except clause) for non-CPU device types instead of leaking a
RuntimeError from DALI_ENFORCE after capsules are partially consumed.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL JanuszL force-pushed the dynamic_as_batch_fast_path branch from 720086b to c553469 Compare April 27, 2026 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants