Silent data corruption (dropped tail of an async copy/kernel) under multi-process contention on multi-tile (PVC) devices
Summary
On Intel Data Center GPU Max (PVC) under the COMPOSITE device hierarchy, running
many processes that share one card intermittently produces silent data corruption:
the tail (last element / last work-item) of an asynchronously-submitted operation is
dropped, even though a synchronize happens before the result is read on the host. It is
never observed single-process (200k+ clean round-trips) and appears only under heavy
multi-process oversubscription of a single tile/card.
Root cause (established experimentally, see below): the default device is the COMPOSITE
root device, which spans both tiles of the card, so operations use implicit cross-tile
(multi-stack) scaling. Under contention, the portion of the op handled by the second
tile fails to retire, dropping the tail. Pinning to a single tile eliminates it.
Environment
- Intel Data Center GPU Max 1550 (PVC), 6 cards × 2 tiles, Aurora
ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
- Julia 1.12, oneAPI.jl 2.6.x, NEO (LTS) 25.x
- Failure requires ~12–24 processes contending on one card + allocation/GC churn
Symptoms
A — dropped tail of a host→device copy. A round-trip Array → oneArray → Array of a
10-element Int32 vector returns with the last 2 elements zeroed (32 of 40 bytes
delivered; 0 is the fresh device buffer's initial content).
B — dropped tail of a kernel (no host memory involved). B = A + J (a
UniformScaling) is similar/copyto!/diagonal-kernel; the result is missing +J on
exactly B[n,n] — the kernel's last work-item never applied its read-modify-write.
B[n,n] reads back as the copied value, so the buffer was intact; the work was dropped.
In both cases the dropped data is always the last element/work-item.
Minimal reproducers
# Symptom A — host<->device round-trip under pool churn
using oneAPI, Random
function churn(; iters=200)
fails = 0; Random.seed!(1); keep = Vector{Any}(undef, 8)
for i in 1:iters
for (j, n) in enumerate((3, 7, 16, 64, 100, 256, 9, 33))
keep[j] = oneArray(rand(Int32, n)) # churn the pool
end
Bc = rand(Int32, 10)
B = oneArray(Bc) # H2D (async)
Array(B) != Bc && (fails += 1) # D2H (synchronizes)
end
println("pid=$(getpid()) fails=$fails/$iters")
end
churn()
# Symptom B — A + UniformScaling
using oneAPI, Random, LinearAlgebra
function churn(; iters=200)
eltypes = (ComplexF32, Float32)
wrappers = (identity, UnitLowerTriangular, UnitUpperTriangular,
LowerTriangular, UpperTriangular, Hermitian, Symmetric)
Random.seed!(1); keep = Vector{Any}(undef, 8); fails = 0
for i in 1:iters
for (j, n) in enumerate((3, 7, 16, 64, 100, 256, 9, 33))
keep[j] = oneArray(rand(Float32, n))
end
for T1 in eltypes, T2 in eltypes, f in wrappers
x = ones(T1, 5, 5); y = oneArray(x)
J = one(T2) * I
host = oneAPI.@allowscalar collect(f(x) + J)
gpu = oneAPI.@allowscalar collect(f(y) + J)
!(gpu ≈ host) && (fails += 1)
end
end
println("pid=$(getpid()) fails=$fails/$iters")
end
churn()
Run many processes on one card (no affinity mask → default COMPOSITE root device):
for i in $(seq 1 24); do julia --project mwe.jl & done; wait
Evidence / root cause
24 processes, 40 iterations each, on one card:
| Config |
Symptom A |
Symptom B |
| default (root device, both tiles, implicit scaling) |
872 mismatches |
6307 mismatches |
ZE_AFFINITY_MASK=<dev>.0 (single tile) |
0 |
0 |
synchronize after every execute! (per-submission completion) |
0 |
0 |
- Single-process never fails → trigger is multi-process oversubscription, not a
logic/eltype bug.
- Pinning to a single tile fixes both → the trigger is implicit cross-tile scaling on
the root device.
- A per-submission
synchronize also fixes both, but serializes all GPU work
(~3.3× slower in the B reproducer: 46 s → 153 s) — so it is not a good global default.
- Disproven: command-list lifetime (retaining all
ZeCommandLists fixes nothing); a
free-from-finalizer race (the victim buffer is provably alive — B reads back the copied
value). BLOCKING_FREE blocks per spec.
This looks like a NEO / Level-Zero implicit-scaling completion issue under single-CCS
multi-process oversubscription (a whole-queue zeCommandQueueSynchronize does not reliably
cover an earlier separately-submitted list's second-tile tail). Worth a parallel report to
intel/compute-runtime.
Proposed fix (oneAPI.jl side)
Default device() to a single sub-device (tile) rather than the COMPOSITE root device,
so the common path never uses implicit cross-tile scaling (which is also the canonical
one-rank-per-tile usage and has zero throughput cost). Spanning a whole card stays opt-in
by selecting a root device explicitly. Additionally, synchronize after the pageable
host→device copy (mirroring the existing device→host path).
I have this implemented and validated (both reproducers → 0 under the same 24-process
contention). Happy to open a PR. Does defaulting to sub-devices sound acceptable, or would
you prefer a documented opt-in / warning instead?
Silent data corruption (dropped tail of an async copy/kernel) under multi-process contention on multi-tile (PVC) devices
Summary
On Intel Data Center GPU Max (PVC) under the COMPOSITE device hierarchy, running
many processes that share one card intermittently produces silent data corruption:
the tail (last element / last work-item) of an asynchronously-submitted operation is
dropped, even though a
synchronizehappens before the result is read on the host. It isnever observed single-process (200k+ clean round-trips) and appears only under heavy
multi-process oversubscription of a single tile/card.
Root cause (established experimentally, see below): the default device is the COMPOSITE
root device, which spans both tiles of the card, so operations use implicit cross-tile
(multi-stack) scaling. Under contention, the portion of the op handled by the second
tile fails to retire, dropping the tail. Pinning to a single tile eliminates it.
Environment
ZE_FLAT_DEVICE_HIERARCHY=COMPOSITESymptoms
A — dropped tail of a host→device copy. A round-trip
Array → oneArray → Arrayof a10-element
Int32vector returns with the last 2 elements zeroed (32 of 40 bytesdelivered; 0 is the fresh device buffer's initial content).
B — dropped tail of a kernel (no host memory involved).
B = A + J(aUniformScaling) issimilar/copyto!/diagonal-kernel; the result is missing+Jonexactly
B[n,n]— the kernel's last work-item never applied its read-modify-write.B[n,n]reads back as the copied value, so the buffer was intact; the work was dropped.In both cases the dropped data is always the last element/work-item.
Minimal reproducers
Run many processes on one card (no affinity mask → default COMPOSITE root device):
Evidence / root cause
24 processes, 40 iterations each, on one card:
ZE_AFFINITY_MASK=<dev>.0(single tile)synchronizeafter everyexecute!(per-submission completion)logic/eltype bug.
the root device.
synchronizealso fixes both, but serializes all GPU work(~3.3× slower in the B reproducer: 46 s → 153 s) — so it is not a good global default.
ZeCommandLists fixes nothing); afree-from-finalizer race (the victim buffer is provably alive — B reads back the copied
value).
BLOCKING_FREEblocks per spec.This looks like a NEO / Level-Zero implicit-scaling completion issue under single-CCS
multi-process oversubscription (a whole-queue
zeCommandQueueSynchronizedoes not reliablycover an earlier separately-submitted list's second-tile tail). Worth a parallel report to
intel/compute-runtime.
Proposed fix (oneAPI.jl side)
Default
device()to a single sub-device (tile) rather than the COMPOSITE root device,so the common path never uses implicit cross-tile scaling (which is also the canonical
one-rank-per-tile usage and has zero throughput cost). Spanning a whole card stays opt-in
by selecting a root device explicitly. Additionally,
synchronizeafter the pageablehost→device copy (mirroring the existing device→host path).
I have this implemented and validated (both reproducers → 0 under the same 24-process
contention). Happy to open a PR. Does defaulting to sub-devices sound acceptable, or would
you prefer a documented opt-in / warning instead?