Migrate concurrent root LP to OMP#1252
Conversation
📝 WalkthroughWalkthroughAdds a manual CUDA-graph wrapper and migrates ping-pong/PDHG/feasibility/weighted-average/adaptive-step-size to use its run(...) API; refactors ping-pong parity cache; replaces std::atomic halt flags with omp_atomic_t and rewrites concurrent dispatcher to OpenMP tasks; adds MIP barrier thread gate and re-enables swath1 tests. ChangesCUDA-graph wrapper and PDLP concurrency changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1632-1647: The barrier raft::handle_t is only warming up cuDSS but
not eagerly initializing cuBLAS/cuSPARSE, so after constructing
barrier_handle_ptr (both in the multi-GPU branch and the single-GPU branch) call
the same init_handler(...) used elsewhere (pass barrier_handle_ptr.get()) to
force cuBLAS/cuSPARSE creation before warmup_cudss() and before dispatching the
barrier task; ensure init_handler is invoked on the newly created
barrier_handle_ptr in both branches.
In `@python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py`:
- Around line 107-109: The re-enabled "/mip/swath1.mps" entry in the test
parameter list is high-risk for flakiness; gate it instead of enabling
unconditionally by either marking the test parameter as xfail (pytest.mark.xfail
with a clear issue reference) or skipping it behind a deterministic flag (e.g.,
an env var or pytest config) so CI stability is preserved; update the parameter
list(s) that include "/mip/swath1.mps" (the two occurrences at the shown diff
and the lines noted 118-120) to wrap that case with pytest.param(...,
marks=pytest.mark.xfail(reason="flaky: graph-capture/concurrency; see
ISSUE-XXXX")) or use pytest.skipif(os.getenv("RUN_FLAKY_TESTS")!="1",
reason=...), and add a one-line comment referencing the tracking issue.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 42fd537a-0b6c-4362-8b81-4d508cca40ba
📒 Files selected for processing (3)
cpp/src/mip_heuristics/mip_constants.hppcpp/src/pdlp/solve.cupython/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py
nguidotti
left a comment
There was a problem hiding this comment.
Looks good to me! Thanks for converting the remaining parts of the solver to OpenMP, Akif!
| std::exception_ptr dual_simplex_exception; | ||
| // TODO: concurrent_halt is std::atomic<int>* across ~25 sites in dual_simplex / barrier / | ||
| // branch_and_bound / pdlp. Could be converted to int* + `#pragma omp atomic` (all-or-nothing, | ||
| // mixing OMP atomic with non-atomic on the same var is UB). Functionally equivalent today. |
There was a problem hiding this comment.
We can use omp_atomic_t (a wrapper over the pragma omp atomic located at utilities/omp_helpers.hpp) instead std::atomic. I follow the style of the std::atomic, so they should be a drop-in replacement.
There was a problem hiding this comment.
I wanted to keep it less invasive as it touches many files. The TODO is for not touching 25 files now. I can give it a try to see how invasive it is.
There was a problem hiding this comment.
We have to be careful with changes to concurrent_halt. As it is also used inside cuDSS. So whatever changes we make need to be compatible with cuDSS.
There was a problem hiding this comment.
I think it should be equivalent to what we are doing with std::atomicversion. But I am not sure if OMP runtime does late submission of memory requests when the other read is not atomic. https://github.com/NVIDIA/cuopt/pull/1252/changes#diff-ff65045b18ca97b752289f52050884f6a91f1cd7b2dec84e2d30eaf539787f1dR290
What do you think @nguidotti ?
There was a problem hiding this comment.
For my understanding, the pragma omp atomic just signal for the compiler to use atomic instructions when reading/writing/updating that memory location. By default, OpenMP enforce a strong memory ordering (seq_cst). If you mix an atomic with a non-atomic operation and they happen at the same time, then you have a race condition.
There was a problem hiding this comment.
In theory, pragma omp atomic and std::atomic should emit similar atomic instructions, but I would avoid to mix them up. My recommendation is to use whatever cuDSS is using for concurrent_halt to avoid potential incompatibilities.
# Conflicts: # python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py
The cuDSS warmup and eager barrier raft::handle_t construction on the main thread were added to keep cuBLAS / cuSPARSE / cuSolverDn / cuDSS first-use synchronous init from invalidating PDLP's CUDA graph capture. manual_cuda_graph_t (from cuda_graph_side_capture) now recovers from cudaErrorStreamCaptureInvalidated by re-running the captured work eagerly, so the preflight is no longer needed. Move barrier handle construction back inside the barrier OMP task body, matching the pre-warmup pattern. Destruction stays at parent scope (post-taskgroup join) so cublasDestroy → cudaDeviceSynchronize doesn't fire during a capture.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/src/pdlp/solve.cu (2)
1530-1535:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCap stand-alone LP OMP team to available thread budget.
For stand-alone LP,
enable_barrieris always true, so the code always requests 3 workers vianum_workers = 1 + 1 + 1. Whenomp_get_max_threads()is 1 or 2, the#pragma omp parallel num_threads(3)directive at line 1650 violates the intended thread budget.Apply the fix to respect the configured thread limit:
Fix
- const int num_workers = 1 + (settings.inside_mip ? 0 : 1) + (enable_barrier ? 1 : 0); + const int requested_workers = + 1 + (settings.inside_mip ? 0 : 1) + (enable_barrier ? 1 : 0); + const int num_workers = std::min(available_threads, requested_workers);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/solve.cu` around lines 1530 - 1535, The current logic always requests 3 threads for stand-alone LP because enable_barrier evaluates true, which can exceed the OpenMP thread budget; update the code that computes num_workers (the value passed to `#pragma omp parallel num_threads(...)`) to cap it by available_threads so it never requests more threads than available (e.g., set num_workers = min(needed_workers, available_threads) where needed_workers is 3 when enable_barrier is true and 1+1 when not), using the existing symbols available_threads, enable_barrier and num_workers so the parallel region no longer asks for num_threads > available_threads.
1591-1605:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winMove the device-0 stream synchronization before creating the barrier task.
Line 1602 synchronizes
problem.handle_ptrfrom inside the spawned barrier task. This causes an unintended serialization: when the dispatcher runsrun_pdlp()immediately after spawning the task, it queues work to that stream. If the barrier task starts later, itssync_stream()call blocks waiting for PDLP to complete, defeating concurrent execution on the multi-GPU path. Moving the sync to the dispatcher thread before task creation ensures GPU work is ordered correctly without blocking the barrier's independent stream.🔧 Minimal fix
- if (enable_barrier) { + if (enable_barrier) { + if (settings.num_gpus > 1) { problem.handle_ptr->sync_stream(); } `#pragma` omp task default(shared) { try { auto call_barrier_thread = [&]() { rmm::cuda_stream_view barrier_stream = rmm::cuda_stream_per_thread; barrier_handle_ptr = std::make_unique<raft::handle_t>(barrier_stream); auto barrier_problem = dual_simplex_problem; barrier_problem.handle_ptr = barrier_handle_ptr.get(); run_barrier_thread<i_t, f_t>(barrier_problem, settings_pdlp, sol_barrier_ptr, timer); }; if (settings.num_gpus > 1) { - problem.handle_ptr->sync_stream(); raft::device_setter device_setter(1); // Scoped variable CUOPT_LOG_DEBUG("Barrier device: %d", device_setter.get_current_device()); call_barrier_thread(); } else {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/solve.cu` around lines 1591 - 1605, The spawned barrier task currently calls problem.handle_ptr->sync_stream() inside the task, which can serialize GPU work; move that sync out of the task and perform problem.handle_ptr->sync_stream() in the dispatcher thread before creating the OpenMP task (i.e., before the if (settings.num_gpus > 1) block that constructs device_setter and calls call_barrier_thread()). Keep the rest of the task body intact (creation of barrier_handle_ptr, barrier_problem.handle_ptr assignment, run_barrier_thread<i_t,f_t>(), and use of raft::device_setter inside the task) so the barrier uses its own stream without blocking PDLP work.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/src/pdlp/pdhg.cu`:
- Line 1060: The ping-pong graph cache currently keys only on should_major but
must also account for buffer parity, so graph_all.run (and ping_pong_graph_t)
can launch against stale primal/dual buffers; update the cache strategy to use
four distinct slots (major/non-major × even/odd) or switch to separate major and
minor ping-pong caches keyed by total_pdlp_iterations (or another monotonic
iteration counter) plus should_major to distinguish buffer parity, and modify
graph_all.run and the ping_pong_graph_t lookup/insert logic to include the
parity bit (e.g., iteration % 2 or total_pdlp_iterations parity) when selecting
the cached executable so each unique (should_major, parity) pair maps to its own
cached executable.
---
Outside diff comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1530-1535: The current logic always requests 3 threads for
stand-alone LP because enable_barrier evaluates true, which can exceed the
OpenMP thread budget; update the code that computes num_workers (the value
passed to `#pragma omp parallel num_threads(...)`) to cap it by
available_threads so it never requests more threads than available (e.g., set
num_workers = min(needed_workers, available_threads) where needed_workers is 3
when enable_barrier is true and 1+1 when not), using the existing symbols
available_threads, enable_barrier and num_workers so the parallel region no
longer asks for num_threads > available_threads.
- Around line 1591-1605: The spawned barrier task currently calls
problem.handle_ptr->sync_stream() inside the task, which can serialize GPU work;
move that sync out of the task and perform problem.handle_ptr->sync_stream() in
the dispatcher thread before creating the OpenMP task (i.e., before the if
(settings.num_gpus > 1) block that constructs device_setter and calls
call_barrier_thread()). Keep the rest of the task body intact (creation of
barrier_handle_ptr, barrier_problem.handle_ptr assignment,
run_barrier_thread<i_t,f_t>(), and use of raft::device_setter inside the task)
so the barrier uses its own stream without blocking PDLP work.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0873e5b9-5498-4249-baf0-88019e9ff442
📒 Files selected for processing (10)
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cucpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cuhcpp/src/pdlp/pdhg.cucpp/src/pdlp/restart_strategy/weighted_average_solution.cucpp/src/pdlp/solve.cucpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cucpp/src/pdlp/utilities/ping_pong_graph.cucpp/src/pdlp/utilities/ping_pong_graph.cuhcpp/src/utilities/manual_cuda_graph.cuhcpp/tests/mip/incumbent_callback_test.cu
💤 Files with no reviewable changes (1)
- cpp/src/pdlp/utilities/ping_pong_graph.cu
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp`:
- Around line 24-31: pdlp.cu dereferences settings_.concurrent_halt->load() but
only has a forward declaration of template omp_atomic_t from
solver_settings.hpp; add an explicit include of the header that defines
omp_atomic_t so the member functions are available. In practice, add `#include`
"utilities/omp_helpers.hpp" at the top of cpp/src/pdlp/pdlp.cu (before any use
of settings_.concurrent_halt or inclusion of pdlp.cuh) so omp_atomic_t<int> is
fully defined when calling settings_.concurrent_halt->load().
In `@cpp/src/pdlp/solve.cu`:
- Line 331: global_concurrent_halt is a namespace-scope shared flag that causes
interference between simultaneous run_concurrent() calls; change it to a
per-invocation flag created inside run_concurrent() (e.g., a local
omp_atomic_t<int> or an allocated instance) and pass that instance to any helper
functions that currently read or write global_concurrent_halt (replace
references to global_concurrent_halt in run_concurrent() and the other affected
sites, including the occurrences noted around the later uses), then remove the
namespace-scope global to ensure each concurrent solve has its own stop flag.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8a34403d-41bc-41c5-875a-71e6a035cfc5
📒 Files selected for processing (11)
cpp/include/cuopt/linear_programming/pdlp/solver_settings.hppcpp/src/barrier/sparse_cholesky.cuhcpp/src/branch_and_bound/branch_and_bound.hppcpp/src/branch_and_bound/pseudo_costs.cppcpp/src/dual_simplex/simplex_solver_settings.hppcpp/src/mip_heuristics/diversity/diversity_manager.cuhcpp/src/mip_heuristics/relaxed_lp/relaxed_lp.cuhcpp/src/pdlp/pdhg.cucpp/src/pdlp/pdhg.hppcpp/src/pdlp/pdlp.cucpp/src/pdlp/solve.cu
|
@akifcorduk is this needed for 26.06? |
|
@rg20 to fully obey thread count, it would be very beneficial. It was a small bug before as we don't obey the low thread count. Performance wise, functionality wise, it is not a big deal i think. |
|
Replaced by #1291, which only includes the code changes related with the OpenMP migration. |
This PR cleans #1252, so only the changes related to the OpenMP are present. ## Original description: `run_concurrent` in `cpp/src/pdlp/solve.cu` now dispatches the barrier and dual simplex workers as `#pragma omp task` inside a `#pragma omp taskgroup` instead of raw `std::thread`. PDLP still runs synchronously on the dispatching thread. MIP path (`omp_in_parallel()`): reuses the upstream `solve_mip` OMP team. Barrier and dual simplex now consume slots from the configured `num_cpu_threads` budget instead of spawning extra OS threads outside it. Stand-alone LP path: stands up a local `#pragma omp parallel + single` with the right number of workers. This PR also removes the confuscated `std::future` logic on the barrier on other PR. Authors: - Nicolas L. Guidotti (https://github.com/nguidotti) - Akif ÇÖRDÜK (https://github.com/akifcorduk) Approvers: - Alice Boucher (https://github.com/aliceb-nv) URL: #1291
run_concurrent in
cpp/src/pdlp/solve.cunow dispatches the barrier and dual simplex workers as #pragma omp task inside a #pragma omp taskgroup instead of rawstd::thread. PDLP still runs synchronously on the dispatching thread.MIP path (omp_in_parallel()): reuses the upstream solve_mip OMP team. Barrier and dual simplex now consume slots from the configured
num_cpu_threadsbudget instead of spawning extra OS threads outside it.Stand-alone LP path: stands up a local
#pragma omp parallel + singlewith the right number of workers.Builds on #1250 , merge after that. This PR also removes the confuscated std::future logic on the barrier on other PR.