Skip to content

Migrate concurrent root LP to OMP#1252

Closed
akifcorduk wants to merge 16 commits into
NVIDIA:release/26.06from
akifcorduk:omp_migrate_concurrent_root
Closed

Migrate concurrent root LP to OMP#1252
akifcorduk wants to merge 16 commits into
NVIDIA:release/26.06from
akifcorduk:omp_migrate_concurrent_root

Conversation

@akifcorduk
Copy link
Copy Markdown
Contributor

@akifcorduk akifcorduk commented May 20, 2026

run_concurrent in cpp/src/pdlp/solve.cu now dispatches the barrier and dual simplex workers as #pragma omp task inside a #pragma omp taskgroup instead of raw std::thread. PDLP still runs synchronously on the dispatching thread.

MIP path (omp_in_parallel()): reuses the upstream solve_mip OMP team. Barrier and dual simplex now consume slots from the configured num_cpu_threads budget instead of spawning extra OS threads outside it.
Stand-alone LP path: stands up a local #pragma omp parallel + single with the right number of workers.

Builds on #1250 , merge after that. This PR also removes the confuscated std::future logic on the barrier on other PR.

@akifcorduk akifcorduk added this to the 26.06 milestone May 20, 2026
@akifcorduk akifcorduk requested review from a team as code owners May 20, 2026 01:01
@akifcorduk akifcorduk added the non-breaking Introduces a non-breaking change label May 20, 2026
@akifcorduk akifcorduk requested a review from Iroy30 May 20, 2026 01:01
@akifcorduk akifcorduk added improvement Improves an existing functionality P0 labels May 20, 2026
@akifcorduk akifcorduk requested review from Bubullzz and Kh4ster May 20, 2026 01:01
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a manual CUDA-graph wrapper and migrates ping-pong/PDHG/feasibility/weighted-average/adaptive-step-size to use its run(...) API; refactors ping-pong parity cache; replaces std::atomic halt flags with omp_atomic_t and rewrites concurrent dispatcher to OpenMP tasks; adds MIP barrier thread gate and re-enables swath1 tests.

Changes

CUDA-graph wrapper and PDLP concurrency changes

Layer / File(s) Summary
manual CUDA-graph wrapper
cpp/src/utilities/manual_cuda_graph.cuh
Add cuopt::manual_cuda_graph_t with run(stream, work), lazy capture/instantiation, capture-invalidation recovery, reset(), is_initialized(), and RAII capture guard.
Ping-pong parity-run API
cpp/src/pdlp/utilities/ping_pong_graph.cuh
Refactor ping_pong_graph_t to use manual_cuda_graph_t even_graph_/odd_graph_ and provide run(total_pdlp_iterations, work) parity dispatch (eager in debug/legacy modes).
Remove old ping-pong CU implementations
cpp/src/pdlp/utilities/ping_pong_graph.cu
Remove destructor and manual capture/launch implementations and delete now-unused includes.
Feasibility-jump: use step_graph_
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cuh, .../feasibility_jump.cu
Replace raw cudaGraphExec_t/graph_created with manual_cuda_graph_t step_graph_; encapsulate step body in lambda and call step_graph_.run(stream, step_body) or run eagerly.
Weighted-average & adaptive step-size
cpp/src/pdlp/restart_strategy/weighted_average_solution.cu, cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu
Wrap graphed device work in graph.run(total_pdlp_iterations, [&]{ ... }) and remove manual capture/launch branching.
PDHG graph-run wrappers & non-major cache
cpp/src/pdlp/pdhg.hpp, cpp/src/pdlp/pdhg.cu
Add graph_all_non_major and get_graph_all_non_major(); move graphed compute regions into graph_all.run(...) / graph_all_non_major.run(...) lambdas, removing explicit capture lifecycle code.
PDLP graph-cache reset
cpp/src/pdlp/pdlp.cu
Reset both ping-pong graph caches (graph_all and graph_all_non_major) when batch dims change.
PDLP concurrent dispatcher → OpenMP tasks
cpp/src/pdlp/solve.cu
Replace std::thread concurrent implementation with OpenMP taskgroup/task dispatcher, switch global_concurrent_halt to omp_atomic_t<int>, derive available threads and compute enable_barrier, capture/propagate subsolver exceptions, and produce sentinel barrier result when skipped.
Atomic -> omp_atomic_t migration
cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp, cpp/src/dual_simplex/simplex_solver_settings.hpp, cpp/src/branch_and_bound/*, cpp/src/branch_and_bound/pseudo_costs.cpp, cpp/src/mip_heuristics/{diversity,relaxed_lp}/*.cuh, cpp/src/barrier/sparse_cholesky.cuh
Add forward-declared cuopt::omp_atomic_t<T> and migrate concurrent_halt/global halt types and parameter signatures from std::atomic<int> to cuopt::omp_atomic_t<int>; update includes/usages and pass raw underlying int* where required.
MIP concurrent gate and incumbent test re-enable
cpp/src/mip_heuristics/mip_constants.hpp, python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py, cpp/tests/mip/incumbent_callback_test.cu
Add CUOPT_CONCURRENT_LP_BARRIER_REQUIRED_THREAD_COUNT (3) and remove skips so /mip/swath1.mps runs in both Python and C++ incumbent-callback tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/cuopt#1208: Related to re-enabling vs disabling mip/swath1.mps incumbent-callback tests (opposite test parametrization changes).
  • NVIDIA/cuopt#1206: Overlaps PDLP concurrent dispatcher and exception/worker-thread handling changes in cpp/src/pdlp/solve.cu.
  • NVIDIA/cuopt#602: Touches branch-and-bound concurrent halt coordination and root LP concurrency used by MIP.

Suggested labels

pdlp, mip

Suggested reviewers

  • Kh4ster
  • Iroy30
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Migrate concurrent root LP to OMP' clearly and specifically describes the main change: moving the concurrent root LP implementation from std::thread to OpenMP pragmas.
Description check ✅ Passed The description provides meaningful context about the changes, explaining the new OMP-based dispatch for barrier and dual simplex workers, the different behavior for MIP vs stand-alone LP paths, and noting dependencies on PR #1250.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1632-1647: The barrier raft::handle_t is only warming up cuDSS but
not eagerly initializing cuBLAS/cuSPARSE, so after constructing
barrier_handle_ptr (both in the multi-GPU branch and the single-GPU branch) call
the same init_handler(...) used elsewhere (pass barrier_handle_ptr.get()) to
force cuBLAS/cuSPARSE creation before warmup_cudss() and before dispatching the
barrier task; ensure init_handler is invoked on the newly created
barrier_handle_ptr in both branches.

In `@python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py`:
- Around line 107-109: The re-enabled "/mip/swath1.mps" entry in the test
parameter list is high-risk for flakiness; gate it instead of enabling
unconditionally by either marking the test parameter as xfail (pytest.mark.xfail
with a clear issue reference) or skipping it behind a deterministic flag (e.g.,
an env var or pytest config) so CI stability is preserved; update the parameter
list(s) that include "/mip/swath1.mps" (the two occurrences at the shown diff
and the lines noted 118-120) to wrap that case with pytest.param(...,
marks=pytest.mark.xfail(reason="flaky: graph-capture/concurrency; see
ISSUE-XXXX")) or use pytest.skipif(os.getenv("RUN_FLAKY_TESTS")!="1",
reason=...), and add a one-line comment referencing the tracking issue.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 42fd537a-0b6c-4362-8b81-4d508cca40ba

📥 Commits

Reviewing files that changed from the base of the PR and between 415b12d and a74e830.

📒 Files selected for processing (3)
  • cpp/src/mip_heuristics/mip_constants.hpp
  • cpp/src/pdlp/solve.cu
  • python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py

Comment thread cpp/src/pdlp/solve.cu Outdated
Copy link
Copy Markdown
Contributor

@nguidotti nguidotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for converting the remaining parts of the solver to OpenMP, Akif!

Comment thread cpp/src/pdlp/solve.cu Outdated
std::exception_ptr dual_simplex_exception;
// TODO: concurrent_halt is std::atomic<int>* across ~25 sites in dual_simplex / barrier /
// branch_and_bound / pdlp. Could be converted to int* + `#pragma omp atomic` (all-or-nothing,
// mixing OMP atomic with non-atomic on the same var is UB). Functionally equivalent today.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use omp_atomic_t (a wrapper over the pragma omp atomic located at utilities/omp_helpers.hpp) instead std::atomic. I follow the style of the std::atomic, so they should be a drop-in replacement.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep it less invasive as it touches many files. The TODO is for not touching 25 files now. I can give it a try to see how invasive it is.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to be careful with changes to concurrent_halt. As it is also used inside cuDSS. So whatever changes we make need to be compatible with cuDSS.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be equivalent to what we are doing with std::atomicversion. But I am not sure if OMP runtime does late submission of memory requests when the other read is not atomic. https://github.com/NVIDIA/cuopt/pull/1252/changes#diff-ff65045b18ca97b752289f52050884f6a91f1cd7b2dec84e2d30eaf539787f1dR290

What do you think @nguidotti ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, the pragma omp atomic just signal for the compiler to use atomic instructions when reading/writing/updating that memory location. By default, OpenMP enforce a strong memory ordering (seq_cst). If you mix an atomic with a non-atomic operation and they happen at the same time, then you have a race condition.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, pragma omp atomic and std::atomic should emit similar atomic instructions, but I would avoid to mix them up. My recommendation is to use whatever cuDSS is using for concurrent_halt to avoid potential incompatibilities.

# Conflicts:
#	python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py
The cuDSS warmup and eager barrier raft::handle_t construction on the
main thread were added to keep cuBLAS / cuSPARSE / cuSolverDn / cuDSS
first-use synchronous init from invalidating PDLP's CUDA graph capture.

manual_cuda_graph_t (from cuda_graph_side_capture) now recovers from
cudaErrorStreamCaptureInvalidated by re-running the captured work
eagerly, so the preflight is no longer needed. Move barrier handle
construction back inside the barrier OMP task body, matching the
pre-warmup pattern. Destruction stays at parent scope (post-taskgroup
join) so cublasDestroy → cudaDeviceSynchronize doesn't fire during a
capture.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/src/pdlp/solve.cu (2)

1530-1535: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Cap stand-alone LP OMP team to available thread budget.

For stand-alone LP, enable_barrier is always true, so the code always requests 3 workers via num_workers = 1 + 1 + 1. When omp_get_max_threads() is 1 or 2, the #pragma omp parallel num_threads(3) directive at line 1650 violates the intended thread budget.

Apply the fix to respect the configured thread limit:

Fix
-    const int num_workers = 1 + (settings.inside_mip ? 0 : 1) + (enable_barrier ? 1 : 0);
+    const int requested_workers =
+      1 + (settings.inside_mip ? 0 : 1) + (enable_barrier ? 1 : 0);
+    const int num_workers = std::min(available_threads, requested_workers);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/solve.cu` around lines 1530 - 1535, The current logic always
requests 3 threads for stand-alone LP because enable_barrier evaluates true,
which can exceed the OpenMP thread budget; update the code that computes
num_workers (the value passed to `#pragma omp parallel num_threads(...)`) to cap
it by available_threads so it never requests more threads than available (e.g.,
set num_workers = min(needed_workers, available_threads) where needed_workers is
3 when enable_barrier is true and 1+1 when not), using the existing symbols
available_threads, enable_barrier and num_workers so the parallel region no
longer asks for num_threads > available_threads.

1591-1605: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Move the device-0 stream synchronization before creating the barrier task.

Line 1602 synchronizes problem.handle_ptr from inside the spawned barrier task. This causes an unintended serialization: when the dispatcher runs run_pdlp() immediately after spawning the task, it queues work to that stream. If the barrier task starts later, its sync_stream() call blocks waiting for PDLP to complete, defeating concurrent execution on the multi-GPU path. Moving the sync to the dispatcher thread before task creation ensures GPU work is ordered correctly without blocking the barrier's independent stream.

🔧 Minimal fix
-      if (enable_barrier) {
+      if (enable_barrier) {
+        if (settings.num_gpus > 1) { problem.handle_ptr->sync_stream(); }
 `#pragma` omp task default(shared)
         {
           try {
             auto call_barrier_thread = [&]() {
               rmm::cuda_stream_view barrier_stream = rmm::cuda_stream_per_thread;
               barrier_handle_ptr         = std::make_unique<raft::handle_t>(barrier_stream);
               auto barrier_problem       = dual_simplex_problem;
               barrier_problem.handle_ptr = barrier_handle_ptr.get();
               run_barrier_thread<i_t, f_t>(barrier_problem, settings_pdlp, sol_barrier_ptr, timer);
             };
             if (settings.num_gpus > 1) {
-              problem.handle_ptr->sync_stream();
               raft::device_setter device_setter(1);  // Scoped variable
               CUOPT_LOG_DEBUG("Barrier device: %d", device_setter.get_current_device());
               call_barrier_thread();
             } else {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/solve.cu` around lines 1591 - 1605, The spawned barrier task
currently calls problem.handle_ptr->sync_stream() inside the task, which can
serialize GPU work; move that sync out of the task and perform
problem.handle_ptr->sync_stream() in the dispatcher thread before creating the
OpenMP task (i.e., before the if (settings.num_gpus > 1) block that constructs
device_setter and calls call_barrier_thread()). Keep the rest of the task body
intact (creation of barrier_handle_ptr, barrier_problem.handle_ptr assignment,
run_barrier_thread<i_t,f_t>(), and use of raft::device_setter inside the task)
so the barrier uses its own stream without blocking PDLP work.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/pdlp/pdhg.cu`:
- Line 1060: The ping-pong graph cache currently keys only on should_major but
must also account for buffer parity, so graph_all.run (and ping_pong_graph_t)
can launch against stale primal/dual buffers; update the cache strategy to use
four distinct slots (major/non-major × even/odd) or switch to separate major and
minor ping-pong caches keyed by total_pdlp_iterations (or another monotonic
iteration counter) plus should_major to distinguish buffer parity, and modify
graph_all.run and the ping_pong_graph_t lookup/insert logic to include the
parity bit (e.g., iteration % 2 or total_pdlp_iterations parity) when selecting
the cached executable so each unique (should_major, parity) pair maps to its own
cached executable.

---

Outside diff comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1530-1535: The current logic always requests 3 threads for
stand-alone LP because enable_barrier evaluates true, which can exceed the
OpenMP thread budget; update the code that computes num_workers (the value
passed to `#pragma omp parallel num_threads(...)`) to cap it by
available_threads so it never requests more threads than available (e.g., set
num_workers = min(needed_workers, available_threads) where needed_workers is 3
when enable_barrier is true and 1+1 when not), using the existing symbols
available_threads, enable_barrier and num_workers so the parallel region no
longer asks for num_threads > available_threads.
- Around line 1591-1605: The spawned barrier task currently calls
problem.handle_ptr->sync_stream() inside the task, which can serialize GPU work;
move that sync out of the task and perform problem.handle_ptr->sync_stream() in
the dispatcher thread before creating the OpenMP task (i.e., before the if
(settings.num_gpus > 1) block that constructs device_setter and calls
call_barrier_thread()). Keep the rest of the task body intact (creation of
barrier_handle_ptr, barrier_problem.handle_ptr assignment,
run_barrier_thread<i_t,f_t>(), and use of raft::device_setter inside the task)
so the barrier uses its own stream without blocking PDLP work.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0873e5b9-5498-4249-baf0-88019e9ff442

📥 Commits

Reviewing files that changed from the base of the PR and between d5b7140 and 1a2a3ca.

📒 Files selected for processing (10)
  • cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu
  • cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cuh
  • cpp/src/pdlp/pdhg.cu
  • cpp/src/pdlp/restart_strategy/weighted_average_solution.cu
  • cpp/src/pdlp/solve.cu
  • cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu
  • cpp/src/pdlp/utilities/ping_pong_graph.cu
  • cpp/src/pdlp/utilities/ping_pong_graph.cuh
  • cpp/src/utilities/manual_cuda_graph.cuh
  • cpp/tests/mip/incumbent_callback_test.cu
💤 Files with no reviewable changes (1)
  • cpp/src/pdlp/utilities/ping_pong_graph.cu

Comment thread cpp/src/pdlp/pdhg.cu Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp`:
- Around line 24-31: pdlp.cu dereferences settings_.concurrent_halt->load() but
only has a forward declaration of template omp_atomic_t from
solver_settings.hpp; add an explicit include of the header that defines
omp_atomic_t so the member functions are available. In practice, add `#include`
"utilities/omp_helpers.hpp" at the top of cpp/src/pdlp/pdlp.cu (before any use
of settings_.concurrent_halt or inclusion of pdlp.cuh) so omp_atomic_t<int> is
fully defined when calling settings_.concurrent_halt->load().

In `@cpp/src/pdlp/solve.cu`:
- Line 331: global_concurrent_halt is a namespace-scope shared flag that causes
interference between simultaneous run_concurrent() calls; change it to a
per-invocation flag created inside run_concurrent() (e.g., a local
omp_atomic_t<int> or an allocated instance) and pass that instance to any helper
functions that currently read or write global_concurrent_halt (replace
references to global_concurrent_halt in run_concurrent() and the other affected
sites, including the occurrences noted around the later uses), then remove the
namespace-scope global to ensure each concurrent solve has its own stop flag.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8a34403d-41bc-41c5-875a-71e6a035cfc5

📥 Commits

Reviewing files that changed from the base of the PR and between 1a2a3ca and 5882409.

📒 Files selected for processing (11)
  • cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp
  • cpp/src/barrier/sparse_cholesky.cuh
  • cpp/src/branch_and_bound/branch_and_bound.hpp
  • cpp/src/branch_and_bound/pseudo_costs.cpp
  • cpp/src/dual_simplex/simplex_solver_settings.hpp
  • cpp/src/mip_heuristics/diversity/diversity_manager.cuh
  • cpp/src/mip_heuristics/relaxed_lp/relaxed_lp.cuh
  • cpp/src/pdlp/pdhg.cu
  • cpp/src/pdlp/pdhg.hpp
  • cpp/src/pdlp/pdlp.cu
  • cpp/src/pdlp/solve.cu

Comment thread cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp
Comment thread cpp/src/pdlp/solve.cu
@rgsl888prabhu rgsl888prabhu changed the base branch from main to release/26.06 May 20, 2026 17:27
@rg20
Copy link
Copy Markdown
Contributor

rg20 commented May 20, 2026

@akifcorduk is this needed for 26.06?

@akifcorduk
Copy link
Copy Markdown
Contributor Author

@rg20 to fully obey thread count, it would be very beneficial. It was a small bug before as we don't obey the low thread count. Performance wise, functionality wise, it is not a big deal i think.

@nguidotti
Copy link
Copy Markdown
Contributor

nguidotti commented May 24, 2026

Replaced by #1291, which only includes the code changes related with the OpenMP migration.

@nguidotti nguidotti closed this May 24, 2026
rapids-bot Bot pushed a commit that referenced this pull request May 25, 2026
This PR cleans  #1252, so only the changes related to the OpenMP are present.

## Original description:

`run_concurrent` in `cpp/src/pdlp/solve.cu` now dispatches the barrier and dual simplex workers as `#pragma omp task` inside a `#pragma omp taskgroup` instead of raw `std::thread`. PDLP still runs synchronously on the dispatching thread.

MIP path (`omp_in_parallel()`): reuses the upstream `solve_mip` OMP team. Barrier and dual simplex now consume slots from the configured `num_cpu_threads` budget instead of spawning extra OS threads outside it.
Stand-alone LP path: stands up a local `#pragma omp parallel + single` with the right number of workers.

This PR also removes the confuscated `std::future` logic on the barrier on other PR.

Authors:
  - Nicolas L. Guidotti (https://github.com/nguidotti)
  - Akif ÇÖRDÜK (https://github.com/akifcorduk)

Approvers:
  - Alice Boucher (https://github.com/aliceb-nv)

URL: #1291
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change P0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants