Summary
The runtime unit-test suite intermittently deadlocks during the TNS Workers "no crash during or after runtime teardown on iOS" worker stress spec. It does not reproduce on multi-core dev machines, but reproduces reliably on the constrained CI runner. Root cause is a genuine AB-BA deadlock between two V8 isolate locks (the main isolate and a worker isolate) — a real (if rare) latent runtime issue, not just a test artifact.
The spec is quarantined at the harness level (a specFilter entry in TestRunner/app/Infrastructure/Jasmine/jasmine-2.0.1/boot.js) so CI is green; this issue tracks the proper runtime fix and re-enabling the spec.
The deadlock (proven from native stacks)
Captured by sampling the hung TestRunner app process on CI (10 snapshots, identical cycle in each):
Main thread — holds the main isolate lock (running JS), is in the spec's 1ms setTimeout loop posting 'send-to-worker':
__NSFireTimer → JS produceMessageInLoop
→ -[NSNotificationCenter postNotificationName:object:userInfo:]
→ __CFNOTIFICATIONCENTER_IS_CALLING_OUT_TO_AN_OBSERVER__ (worker's queue:nil observer block)
→ tns::ArgConverter::MethodCallback → v8::Locker::Locker(workerIsolate)
→ __psynch_mutexwait // BLOCKED waiting for the worker isolate lock
Worker thread — holds its own isolate lock (loading TeardownCrashWorker.js), triggers a main-extended class's +initialize:
WorkerWrapper::BackgroundLooper → Runtime::RunModule → ... → ObjC msgSend
→ initializeNonMetaClass → CALLING_SOME_+initialize_METHOD
→ block in ClassBuilder::RegisterNativeTypeScriptExtendsFunction (NativeScript/runtime/ClassBuilder.mm:266)
→ v8::Locker::Locker(mainIsolate)
→ __psynch_mutexwait // BLOCKED waiting for the main isolate lock
(A third thread is a secondary victim, blocked on the ObjC class-initialization os_unfair_lock held by the stuck worker.)
So: main holds main-lock / wants worker-lock; worker holds worker-lock / wants main-lock. Neither can proceed. It only manifests when those two windows overlap, which is why fast machines miss it and constrained CI hits it.
Why each side crosses isolates
- main → worker: the worker registered an
NSNotificationCenter observer with a nil queue, so the block runs synchronously on the posting (main) thread; NativeScript marshals it into the worker isolate, taking the worker's v8::Locker.
- worker → main:
ClassBuilder::RegisterNativeTypeScriptExtendsFunction installs a +initialize IMP that captures the defining isolate and does v8::Locker locker(isolate) (ClassBuilder.mm:262-296). +initialize fires lazily on whichever thread first messages the class — here a worker thread first-touches a main-defined extended class, so the worker runs +initialize and locks the main isolate.
Reproduction conditions
- Constrained cores (CI runner; not reproducible on many-core dev machines).
- A worker observing a main-thread notification with a nil queue, while the worker is still initializing/first-touching a main-isolate-extended class.
- This pattern is unusual but possible in production, so the fix matters beyond CI.
Candidate fix directions (both deep; need a real repro harness + review)
- worker → main side — avoid running a foreign isolate's
+initialize from a worker thread (e.g. seed/force initialization on the defining isolate's thread at registration time, or make the IMP defer to the defining isolate's runloop). Risk: class-extension / ObjCExposedMethods timing.
- main → worker side — invoke a worker-owned callback (e.g. a nil-queue notification block) on the worker's runloop instead of synchronously on the posting thread. Risk: changes synchronous-callback semantics app-wide.
Diagnostics
Native-stack capture on hang is now wired into CI (.github/scripts/sample-hung-app.sh, uploaded as the test-diagnostics artifact), and the in-app Jasmine progress beacon reports the last suite to the XCTest harness — so any future suite hang is self-diagnosing.
Summary
The runtime unit-test suite intermittently deadlocks during the
TNS Workers"no crash during or after runtime teardown on iOS" worker stress spec. It does not reproduce on multi-core dev machines, but reproduces reliably on the constrained CI runner. Root cause is a genuine AB-BA deadlock between two V8 isolate locks (the main isolate and a worker isolate) — a real (if rare) latent runtime issue, not just a test artifact.The spec is quarantined at the harness level (a
specFilterentry inTestRunner/app/Infrastructure/Jasmine/jasmine-2.0.1/boot.js) so CI is green; this issue tracks the proper runtime fix and re-enabling the spec.The deadlock (proven from native stacks)
Captured by sampling the hung
TestRunnerapp process on CI (10 snapshots, identical cycle in each):Main thread — holds the main isolate lock (running JS), is in the spec's 1ms
setTimeoutloop posting'send-to-worker':Worker thread — holds its own isolate lock (loading
TeardownCrashWorker.js), triggers a main-extended class's+initialize:(A third thread is a secondary victim, blocked on the ObjC class-initialization
os_unfair_lockheld by the stuck worker.)So: main holds main-lock / wants worker-lock; worker holds worker-lock / wants main-lock. Neither can proceed. It only manifests when those two windows overlap, which is why fast machines miss it and constrained CI hits it.
Why each side crosses isolates
NSNotificationCenterobserver with a nil queue, so the block runs synchronously on the posting (main) thread; NativeScript marshals it into the worker isolate, taking the worker'sv8::Locker.ClassBuilder::RegisterNativeTypeScriptExtendsFunctioninstalls a+initializeIMP that captures the defining isolate and doesv8::Locker locker(isolate)(ClassBuilder.mm:262-296).+initializefires lazily on whichever thread first messages the class — here a worker thread first-touches a main-defined extended class, so the worker runs+initializeand locks the main isolate.Reproduction conditions
Candidate fix directions (both deep; need a real repro harness + review)
+initializefrom a worker thread (e.g. seed/force initialization on the defining isolate's thread at registration time, or make the IMP defer to the defining isolate's runloop). Risk: class-extension /ObjCExposedMethodstiming.Diagnostics
Native-stack capture on hang is now wired into CI (
.github/scripts/sample-hung-app.sh, uploaded as thetest-diagnosticsartifact), and the in-app Jasmine progress beacon reports the last suite to the XCTest harness — so any future suite hang is self-diagnosing.