Silent data corruption (dropped tail of an async copy/kernel) under multi-process contention on multi-tile (PVC) devices

# Silent data corruption (dropped tail of an async copy/kernel) under multi-process contention on multi-tile (PVC) devices

## Summary

On Intel Data Center GPU Max (PVC) under the **COMPOSITE** device hierarchy, running
**many processes that share one card** intermittently produces **silent data corruption**:
the **tail** (last element / last work-item) of an asynchronously-submitted operation is
dropped, even though a `synchronize` happens before the result is read on the host. It is
**never** observed single-process (200k+ clean round-trips) and appears only under heavy
multi-process oversubscription of a single tile/card.

Root cause (established experimentally, see below): the **default device is the COMPOSITE
root device, which spans both tiles** of the card, so operations use **implicit cross-tile
(multi-stack) scaling**. Under contention, the portion of the op handled by the second
tile fails to retire, dropping the tail. **Pinning to a single tile eliminates it.**

## Environment

- Intel Data Center GPU Max 1550 (PVC), 6 cards × 2 tiles, Aurora
- `ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE`
- Julia 1.12, oneAPI.jl 2.6.x, NEO (LTS) 25.x
- Failure requires ~12–24 processes contending on one card + allocation/GC churn

## Symptoms

**A — dropped tail of a host→device copy.** A round-trip `Array → oneArray → Array` of a
10-element `Int32` vector returns with the **last 2 elements zeroed** (32 of 40 bytes
delivered; 0 is the fresh device buffer's initial content).

**B — dropped tail of a kernel (no host memory involved).** `B = A + J` (a
`UniformScaling`) is `similar`/`copyto!`/diagonal-kernel; the result is missing `+J` on
**exactly `B[n,n]`** — the kernel's last work-item never applied its read-modify-write.
`B[n,n]` reads back as the *copied* value, so the buffer was intact; the work was dropped.

In both cases the dropped data is always the **last** element/work-item.

## Minimal reproducers

```julia
# Symptom A — host<->device round-trip under pool churn
using oneAPI, Random
function churn(; iters=200)
    fails = 0; Random.seed!(1); keep = Vector{Any}(undef, 8)
    for i in 1:iters
        for (j, n) in enumerate((3, 7, 16, 64, 100, 256, 9, 33))
            keep[j] = oneArray(rand(Int32, n))          # churn the pool
        end
        Bc = rand(Int32, 10)
        B  = oneArray(Bc)                                # H2D (async)
        Array(B) != Bc && (fails += 1)                  # D2H (synchronizes)
    end
    println("pid=$(getpid()) fails=$fails/$iters")
end
churn()
```

```julia
# Symptom B — A + UniformScaling
using oneAPI, Random, LinearAlgebra
function churn(; iters=200)
    eltypes = (ComplexF32, Float32)
    wrappers = (identity, UnitLowerTriangular, UnitUpperTriangular,
                LowerTriangular, UpperTriangular, Hermitian, Symmetric)
    Random.seed!(1); keep = Vector{Any}(undef, 8); fails = 0
    for i in 1:iters
        for (j, n) in enumerate((3, 7, 16, 64, 100, 256, 9, 33))
            keep[j] = oneArray(rand(Float32, n))
        end
        for T1 in eltypes, T2 in eltypes, f in wrappers
            x = ones(T1, 5, 5); y = oneArray(x)
            J = one(T2) * I
            host = oneAPI.@allowscalar collect(f(x) + J)
            gpu  = oneAPI.@allowscalar collect(f(y) + J)
            !(gpu ≈ host) && (fails += 1)
        end
    end
    println("pid=$(getpid()) fails=$fails/$iters")
end
churn()
```

Run many processes on one card (no affinity mask → default COMPOSITE root device):

```bash
for i in $(seq 1 24); do julia --project mwe.jl & done; wait
```

## Evidence / root cause

24 processes, 40 iterations each, on one card:

| Config | Symptom A | Symptom B |
|---|---|---|
| **default** (root device, both tiles, implicit scaling) | **872** mismatches | **6307** mismatches |
| `ZE_AFFINITY_MASK=<dev>.0` (single tile) | **0** | **0** |
| `synchronize` after every `execute!` (per-submission completion) | **0** | **0** |

- **Single-process never fails** → trigger is multi-process oversubscription, not a
  logic/eltype bug.
- **Pinning to a single tile fixes both** → the trigger is implicit cross-tile scaling on
  the root device.
- **A per-submission `synchronize` also fixes both**, but serializes all GPU work
  (~3.3× slower in the B reproducer: 46 s → 153 s) — so it is *not* a good global default.
- Disproven: command-list lifetime (retaining all `ZeCommandList`s fixes nothing); a
  free-from-finalizer race (the victim buffer is provably alive — B reads back the copied
  value). `BLOCKING_FREE` blocks per spec.

This looks like a NEO / Level-Zero implicit-scaling completion issue under single-CCS
multi-process oversubscription (a whole-queue `zeCommandQueueSynchronize` does not reliably
cover an earlier separately-submitted list's second-tile tail). Worth a parallel report to
[intel/compute-runtime](https://github.com/intel/compute-runtime).

## Proposed fix (oneAPI.jl side)

Default `device()` to a **single sub-device (tile)** rather than the COMPOSITE root device,
so the common path never uses implicit cross-tile scaling (which is also the canonical
one-rank-per-tile usage and has zero throughput cost). Spanning a whole card stays opt-in
by selecting a root device explicitly. Additionally, `synchronize` after the pageable
host→device copy (mirroring the existing device→host path).

I have this implemented and validated (both reproducers → 0 under the same 24-process
contention). Happy to open a PR. Does defaulting to sub-devices sound acceptable, or would
you prefer a documented opt-in / warning instead?


Config	Symptom A	Symptom B
default (root device, both tiles, implicit scaling)	872 mismatches	6307 mismatches
`ZE_AFFINITY_MASK=<dev>.0` (single tile)	0	0
`synchronize` after every `execute!` (per-submission completion)	0	0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silent data corruption (dropped tail of an async copy/kernel) under multi-process contention on multi-tile (PVC) devices #576

Silent data corruption (dropped tail of an async copy/kernel) under multi-process contention on multi-tile (PVC) devices

Summary

Environment

Symptoms

Minimal reproducers

Evidence / root cause

Proposed fix (oneAPI.jl side)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Silent data corruption (dropped tail of an async copy/kernel) under multi-process contention on multi-tile (PVC) devices #576

Description

Silent data corruption (dropped tail of an async copy/kernel) under multi-process contention on multi-tile (PVC) devices

Summary

Environment

Symptoms

Minimal reproducers

Evidence / root cause

Proposed fix (oneAPI.jl side)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions