Skip to content

Silent data corruption (dropped tail of an async copy/kernel) under multi-process contention on multi-tile (PVC) devices #576

@michel2323

Description

@michel2323

Silent data corruption (dropped tail of an async copy/kernel) under multi-process contention on multi-tile (PVC) devices

Summary

On Intel Data Center GPU Max (PVC) under the COMPOSITE device hierarchy, running
many processes that share one card intermittently produces silent data corruption:
the tail (last element / last work-item) of an asynchronously-submitted operation is
dropped, even though a synchronize happens before the result is read on the host. It is
never observed single-process (200k+ clean round-trips) and appears only under heavy
multi-process oversubscription of a single tile/card.

Root cause (established experimentally, see below): the default device is the COMPOSITE
root device, which spans both tiles
of the card, so operations use implicit cross-tile
(multi-stack) scaling
. Under contention, the portion of the op handled by the second
tile fails to retire, dropping the tail. Pinning to a single tile eliminates it.

Environment

  • Intel Data Center GPU Max 1550 (PVC), 6 cards × 2 tiles, Aurora
  • ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
  • Julia 1.12, oneAPI.jl 2.6.x, NEO (LTS) 25.x
  • Failure requires ~12–24 processes contending on one card + allocation/GC churn

Symptoms

A — dropped tail of a host→device copy. A round-trip Array → oneArray → Array of a
10-element Int32 vector returns with the last 2 elements zeroed (32 of 40 bytes
delivered; 0 is the fresh device buffer's initial content).

B — dropped tail of a kernel (no host memory involved). B = A + J (a
UniformScaling) is similar/copyto!/diagonal-kernel; the result is missing +J on
exactly B[n,n] — the kernel's last work-item never applied its read-modify-write.
B[n,n] reads back as the copied value, so the buffer was intact; the work was dropped.

In both cases the dropped data is always the last element/work-item.

Minimal reproducers

# Symptom A — host<->device round-trip under pool churn
using oneAPI, Random
function churn(; iters=200)
    fails = 0; Random.seed!(1); keep = Vector{Any}(undef, 8)
    for i in 1:iters
        for (j, n) in enumerate((3, 7, 16, 64, 100, 256, 9, 33))
            keep[j] = oneArray(rand(Int32, n))          # churn the pool
        end
        Bc = rand(Int32, 10)
        B  = oneArray(Bc)                                # H2D (async)
        Array(B) != Bc && (fails += 1)                  # D2H (synchronizes)
    end
    println("pid=$(getpid()) fails=$fails/$iters")
end
churn()
# Symptom B — A + UniformScaling
using oneAPI, Random, LinearAlgebra
function churn(; iters=200)
    eltypes = (ComplexF32, Float32)
    wrappers = (identity, UnitLowerTriangular, UnitUpperTriangular,
                LowerTriangular, UpperTriangular, Hermitian, Symmetric)
    Random.seed!(1); keep = Vector{Any}(undef, 8); fails = 0
    for i in 1:iters
        for (j, n) in enumerate((3, 7, 16, 64, 100, 256, 9, 33))
            keep[j] = oneArray(rand(Float32, n))
        end
        for T1 in eltypes, T2 in eltypes, f in wrappers
            x = ones(T1, 5, 5); y = oneArray(x)
            J = one(T2) * I
            host = oneAPI.@allowscalar collect(f(x) + J)
            gpu  = oneAPI.@allowscalar collect(f(y) + J)
            !(gpu  host) && (fails += 1)
        end
    end
    println("pid=$(getpid()) fails=$fails/$iters")
end
churn()

Run many processes on one card (no affinity mask → default COMPOSITE root device):

for i in $(seq 1 24); do julia --project mwe.jl & done; wait

Evidence / root cause

24 processes, 40 iterations each, on one card:

Config Symptom A Symptom B
default (root device, both tiles, implicit scaling) 872 mismatches 6307 mismatches
ZE_AFFINITY_MASK=<dev>.0 (single tile) 0 0
synchronize after every execute! (per-submission completion) 0 0
  • Single-process never fails → trigger is multi-process oversubscription, not a
    logic/eltype bug.
  • Pinning to a single tile fixes both → the trigger is implicit cross-tile scaling on
    the root device.
  • A per-submission synchronize also fixes both, but serializes all GPU work
    (~3.3× slower in the B reproducer: 46 s → 153 s) — so it is not a good global default.
  • Disproven: command-list lifetime (retaining all ZeCommandLists fixes nothing); a
    free-from-finalizer race (the victim buffer is provably alive — B reads back the copied
    value). BLOCKING_FREE blocks per spec.

This looks like a NEO / Level-Zero implicit-scaling completion issue under single-CCS
multi-process oversubscription (a whole-queue zeCommandQueueSynchronize does not reliably
cover an earlier separately-submitted list's second-tile tail). Worth a parallel report to
intel/compute-runtime.

Proposed fix (oneAPI.jl side)

Default device() to a single sub-device (tile) rather than the COMPOSITE root device,
so the common path never uses implicit cross-tile scaling (which is also the canonical
one-rank-per-tile usage and has zero throughput cost). Spanning a whole card stays opt-in
by selecting a root device explicitly. Additionally, synchronize after the pageable
host→device copy (mirroring the existing device→host path).

I have this implemented and validated (both reproducers → 0 under the same 24-process
contention). Happy to open a PR. Does defaulting to sub-devices sound acceptable, or would
you prefer a documented opt-in / warning instead?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions