Skip to content

Worker teardown: AB-BA deadlock between main and worker V8 isolate locks (constrained-core/CI) #397

Description

@NathanWalker

Summary

The runtime unit-test suite intermittently deadlocks during the TNS Workers "no crash during or after runtime teardown on iOS" worker stress spec. It does not reproduce on multi-core dev machines, but reproduces reliably on the constrained CI runner. Root cause is a genuine AB-BA deadlock between two V8 isolate locks (the main isolate and a worker isolate) — a real (if rare) latent runtime issue, not just a test artifact.

The spec is quarantined at the harness level (a specFilter entry in TestRunner/app/Infrastructure/Jasmine/jasmine-2.0.1/boot.js) so CI is green; this issue tracks the proper runtime fix and re-enabling the spec.

The deadlock (proven from native stacks)

Captured by sampling the hung TestRunner app process on CI (10 snapshots, identical cycle in each):

Main thread — holds the main isolate lock (running JS), is in the spec's 1ms setTimeout loop posting 'send-to-worker':

__NSFireTimer → JS produceMessageInLoop
  → -[NSNotificationCenter postNotificationName:object:userInfo:]
    → __CFNOTIFICATIONCENTER_IS_CALLING_OUT_TO_AN_OBSERVER__   (worker's queue:nil observer block)
      → tns::ArgConverter::MethodCallback → v8::Locker::Locker(workerIsolate)
        → __psynch_mutexwait        // BLOCKED waiting for the worker isolate lock

Worker thread — holds its own isolate lock (loading TeardownCrashWorker.js), triggers a main-extended class's +initialize:

WorkerWrapper::BackgroundLooper → Runtime::RunModule → ... → ObjC msgSend
  → initializeNonMetaClass → CALLING_SOME_+initialize_METHOD
    → block in ClassBuilder::RegisterNativeTypeScriptExtendsFunction (NativeScript/runtime/ClassBuilder.mm:266)
      → v8::Locker::Locker(mainIsolate)
        → __psynch_mutexwait        // BLOCKED waiting for the main isolate lock

(A third thread is a secondary victim, blocked on the ObjC class-initialization os_unfair_lock held by the stuck worker.)

So: main holds main-lock / wants worker-lock; worker holds worker-lock / wants main-lock. Neither can proceed. It only manifests when those two windows overlap, which is why fast machines miss it and constrained CI hits it.

Why each side crosses isolates

  • main → worker: the worker registered an NSNotificationCenter observer with a nil queue, so the block runs synchronously on the posting (main) thread; NativeScript marshals it into the worker isolate, taking the worker's v8::Locker.
  • worker → main: ClassBuilder::RegisterNativeTypeScriptExtendsFunction installs a +initialize IMP that captures the defining isolate and does v8::Locker locker(isolate) (ClassBuilder.mm:262-296). +initialize fires lazily on whichever thread first messages the class — here a worker thread first-touches a main-defined extended class, so the worker runs +initialize and locks the main isolate.

Reproduction conditions

  • Constrained cores (CI runner; not reproducible on many-core dev machines).
  • A worker observing a main-thread notification with a nil queue, while the worker is still initializing/first-touching a main-isolate-extended class.
  • This pattern is unusual but possible in production, so the fix matters beyond CI.

Candidate fix directions (both deep; need a real repro harness + review)

  1. worker → main side — avoid running a foreign isolate's +initialize from a worker thread (e.g. seed/force initialization on the defining isolate's thread at registration time, or make the IMP defer to the defining isolate's runloop). Risk: class-extension / ObjCExposedMethods timing.
  2. main → worker side — invoke a worker-owned callback (e.g. a nil-queue notification block) on the worker's runloop instead of synchronously on the posting thread. Risk: changes synchronous-callback semantics app-wide.

Diagnostics

Native-stack capture on hang is now wired into CI (.github/scripts/sample-hung-app.sh, uploaded as the test-diagnostics artifact), and the in-app Jasmine progress beacon reports the last suite to the XCTest harness — so any future suite hang is self-diagnosing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions