Skip to content

Route DebuggerController operations through a per-controller worker queue#1090

Merged
xusheng6 merged 1 commit into
devfrom
test_refactor-worker-queue
Jun 16, 2026
Merged

Route DebuggerController operations through a per-controller worker queue#1090
xusheng6 merged 1 commit into
devfrom
test_refactor-worker-queue

Conversation

@xusheng6

@xusheng6 xusheng6 commented May 20, 2026

Copy link
Copy Markdown
Member

Replaces the "spawn one detached std::thread per operation" model in DebuggerController with a single per-controller worker thread + FIFO queue, and makes the controller own its state directly instead of bouncing it through the event dispatcher.

Implements #1088 and #1089. Supersedes the auto-closed #1087.

Why

The per-op detach pattern (17 sites) captured this and detached, so a controller could be destroyed while a worker was still running — the recurring use-after-free family in Sentry (#1083, BINARYNINJA-1A/3X/5Z/60/7M). The queue makes lifetime structural: the worker is owned by the controller and joined in the destructor, so no adapter work can outlive it.

What changed

  • Worker queue. All ops route through Submit / SubmitAndWait. One worker thread per controller, joined on destruction. Submit runs inline if called from the worker itself (so RestartQuit + Launch doesn't self-deadlock).
  • Timeout. Every XxxAndWait gains an optional std::chrono::milliseconds timeout (default = wait forever). Existing callers unchanged.
  • Pause / Restart / Quit / Detach interrupt a running target out-of-band: they call BreakInto() on the caller's thread, then queue the actual op. Without this they'd sit behind the in-flight resume forever.
  • Adapter-stop channel. AdapterStoppedEvent is now an internal worker signal (condvar), not a public dispatcher event. Conditional-breakpoint silent-resume moved onto the worker; the dispatcher's stop machinery, m_lastAdapterStopEventConsumed, and m_suppressResumeEvent are gone. Adapters are unchanged.
  • State on the worker. EventHandler (the controller subscribing to its own broadcast) is replaced by ApplyOwnStateForEvent, called inline in PostDebuggerEvent before the event is broadcast. The dispatcher is now pure fan-out to external consumers (UI/plugins). This removes the worker-vs-dispatcher race that broke Restart.
  • Mutexes removed. m_adapterMutex / m_adapterMutex2 guarded concurrent adapter access that can no longer happen; replaced with an assert(on worker thread).
  • Stop is always reported. The worker op calls NotifyStopped even on a user-requested break (the old Pause path used to do it; the new out-of-band Pause doesn't), fixing "target halts but UI still shows running".

API compatibility

Public method signatures unchanged (timeout is an appended optional arg). Callers in core/ffi.cpp and ui/uinotification.cpp need no changes.

Out of scope (follow-ups)

  • Adapter-internal threads (EngineLoop, LLDB listener, RSP loops) — need adapter dtors as join barriers.
  • m_targetControlMutex is now largely redundant; kept to limit churn.
  • Refcounted DebugAdapter was tried and reverted — the adapter has no independent lifecycle (freed only in ~DebuggerState), so it solved no real UAF.
  • UI repaint throttling for fast stepping — UI widgets: mark-dirty-on-change, repaint at most every 100ms #1093.

Test plan

  • Builds clean (ninja — core + ui)
  • Launch / step / breakpoint / pause while running / detach mid-step / close tab mid-op
  • Restart on a running target
  • si typed directly in the LLDB REPL while idle (spontaneous stop)
  • Conditional breakpoint with a false condition (silent resume)
  • IL step surfaces one stop per user action
  • test/debugger_test.py

Comment thread core/debuggercontroller.cpp Outdated
// pair (which guarded against concurrent adapter access from multiple spawned threads)
// is no longer needed. The new Pause path bypasses this method entirely and calls
// m_adapter->BreakInto() out-of-band; everything else funnels through here on the worker.
assert(std::this_thread::get_id() == m_workerThreadId);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using BN_ASSERT or BN_RELEASE_ASSERT, depending on whether we want to validate this only in debug builds or also in our release builds. I'd lean towards the latter given that this check is cheap.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

Comment thread core/debuggercontroller.cpp Outdated
DebuggerController::~DebuggerController()
{
{
std::lock_guard<std::mutex> lock(m_workQueueMutex);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an issue with the way mutexes and condition variables are used here that can lead to missed wakeups. Specifically, DebuggerController::WaitForAdapterStop uses m_adapterStopMutex with its condition variable, while this store occurs with m_workQueueMutex held. This means there's a race that can happen where m_workerShouldExit is set here but DebuggerController::WaitForAdapterStop does not see it.

This is related to the note at https://en.cppreference.com/cpp/thread/condition_variable that says:

Even if the shared variable is atomic, it must be modified while owning the mutex to correctly publish the modification to the waiting thread.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to

	{
		std::scoped_lock shutdownLock(m_workQueueMutex, m_adapterStopMutex);
		m_workerShouldExit = true;
	}

which I believe should fix the issue

Comment thread core/debuggercontroller.cpp Outdated
// (GoAndWaitOnWorker etc.) is the sole notifier; it calls NotifyStopped on every
// genuine stop regardless of m_userRequestedBreak. m_userRequestedBreak only
// suppresses conditional-breakpoint auto-resume in ShouldSilentResumeAfterStop.
m_userRequestedBreak = true;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to modify m_userRequestedBreak without taking any locks? It appears to potentially be read / written from different threads, so either needs to always be accessed with a lock held or to be std::atomic<bool>.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this boolean is a leftover from the old code. It is no longer needed in the new design, I will delete it

Comment thread core/debuggercontroller.h

private:
// m_adapter is the active debug adapter for this controller. Written exclusively
// from the worker thread (in CreateDebugAdapter); read both from the worker

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this field is accessed by multiple threads, don't accesses need to be guarded by a mutex to avoid being data races?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah data races are not the focus of this PR, they will need a separate one to address

Comment thread core/debuggercontroller.cpp Outdated

void DebuggerController::WorkerThreadMain()
{
m_workerThreadId = std::this_thread::get_id();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is set on the worker thread but read from other threads without any synchronization, which is a data race. This needs some form of synchronization.

Another option would be to skip storing a thread ID, and instead use a thread-local variable that points to the DebuggerController instance when on the worker thread and is nullptr otherwise. To check whether code is running on the worker thread you'd then instead test t_controllerOnWorker == this. Since it is a thread-local variable there is no need for synchronization.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adopted the suggested method

Comment thread core/debuggercontroller.cpp Outdated
// caches updated via UpdateCaches) are mutated under the worker's serialization
// for ops the worker generates (TargetStopped, LaunchFailure, etc.) and under
// the adapter event thread for spontaneous events (TargetExited, RegisterChanged,
// ActiveThreadChanged). Concurrent mutation is unlikely in practice because the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned by this reasoning. It seems to say that this code contains data races, but it's fine because they probably don't happen very often.

A data race is when two or more threads access the same memory location, with at least one of those accesses being a write, and there is no synchronization between the accesses (synchronization here is defined in terms of the happens-before relationship).

In C++, if a data race occurs, the behavior of the program is undefined. The compiler is free to assume that data races do not happen and optimize the code on that basis. This makes it impossible to reason about what the code is actually doing.

Comment thread core/debuggercontroller.h
// in-flight op to settle before returning). A timeout of milliseconds::max() means
// "wait forever" and bypasses wait_for entirely (avoids overflow inside the stdlib).
template<typename F>
auto SubmitAndWait(F&& f, std::chrono::milliseconds timeout)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What behavior does this have if run on the worker thread? Submit has an explicit case for it. Does this need one? If it's never expected to be called from the worker thread, an assert would serve to document that and catch any potential future misuse.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the code closer, I have decided to make both of them NOT re-entrant. There is also no place where they are called in a reentrant way.

Just for the record -- Submit can be made re-entrant safe, and SubmitAndWait, I believe, cannot

Comment thread core/debuggercontroller.h Outdated
DebugStopReason RestartAndWaitOnWorker();
void DetachAndWaitOnWorker();
void QuitAndWaitOnWorker();
DebugStopReason PauseAndWaitOnWorker();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PauseAndWaitOnWorker appears to not be referenced anywhere.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@xusheng6

Copy link
Copy Markdown
Member Author

Hi @bdash thanks for the review! I forgot to mention that this PR mainly means to be clean up the lifetime + control-flow issues, so that I can deal with the data races in a follow-up PR. I did a bad job in communicating this, and I guess you could not have known that because the last commit I added did appear to trying to handle the data races. I know that one is not sufficient yet -- I was just testing the general visibility of the idea, and data races will be addressed in a future PR

I will go ahead and address the new issues you mention that are introduced by the new code, and have you look at it again

// Thread-safety: this body touches m_state's connection/execution status, plus
// m_lastIP / m_currentIP / m_exitCode on the controller -- all plain (non-atomic,
// unlocked) fields read from other threads (UI render layer, file accessor) without
// synchronization. Those are pre-existing data races reported in #1091; this PR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remainder of this comment is in terms of "this PR" and "this commit", which won't make much sense once the change is merged.

I'd probably just end this with "Fixing these data races is tracked in #1091".

@xusheng6 xusheng6 force-pushed the test_refactor-worker-queue branch from 4d3a9ae to 4ff2922 Compare June 16, 2026 21:05
@xusheng6 xusheng6 merged commit 723b403 into dev Jun 16, 2026
1 check passed
@xusheng6 xusheng6 deleted the test_refactor-worker-queue branch June 16, 2026 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants