Skip to content

Fix: cancellation wiring#2772

Merged
tempusfrangit merged 14 commits intomainfrom
fix/cancellation-wiring
Feb 25, 2026
Merged

Fix: cancellation wiring#2772
tempusfrangit merged 14 commits intomainfrom
fix/cancellation-wiring

Conversation

@tempusfrangit
Copy link
Member

@tempusfrangit tempusfrangit commented Feb 24, 2026

Summary

Wire cancellation end-to-end from HTTP to worker subprocess, and make CancelationException a proper public API.

Cancel was accepted by the HTTP layer but never reached the worker. This wires the full path: HTTP cancel/connection-drop → supervisor → orchestrator → ControlRequest::Cancel → worker → PyThreadState_SetAsyncExc.

Key changes

Cancellation wiring (existing commits)

  • Orchestrator: add cancel_by_prediction_id (resolves pred ID → slot ID)
  • Supervisor: cancel() delegates to orchestrator via spawned task
  • SyncPredictionGuard: calls supervisor.cancel() on drop (not just token)
  • Sync HTTP handler: spawns prediction in background task so slot lifetime is not tied to the HTTP connection (fixes permit leak on disconnect)
  • worker_bridge: capture py_thread_id at prediction start, use it for cancel_sync_thread(); async cancel still uses future.cancel()
  • Fix sync cancel status: upgrade PyErr to Cancelled when slot is marked cancelled

CancelationException as BaseException + public SDK export

  • CancelationException now derives from BaseException (not Exception) so bare except Exception blocks cannot swallow cancellation — matching KeyboardInterrupt and asyncio.CancelledError
  • Defined as a static type via pyo3_stub_gen::create_exception! (replaces dynamic builtins.type() approach)
  • Exported via coglet._implcogletcog.exceptionscog.CancelationException
  • Auto-generated type stubs include proper class CancelationException(BaseException) definition

Dead code removal

  • Removed SIGUSR1 signal handler, CANCELABLE flag, CancelableGuard, enter_cancelable() — all dead since PyThreadState_SetAsyncExc replaced SIGUSR1
  • Removed _is_cancelable Python getter from Server
  • Updated is_cancelation_exception to use the static type instead of dynamic import from nonexistent cog.server.exceptions

Documentation

  • New Cancellation section in docs/python.md documenting CancelationException usage, import paths, BaseException semantics, and re-raise requirement
  • Updated docs/http.md cancel endpoint docs to use cog.CancelationException import path
  • Regenerated docs/llms.txt

Tests

  • 5 new Python cancel tests (explicit cancel sync/async, connection-drop)
  • 2 txtar integration tests

closes: #2748

Cancel was accepted by the HTTP layer but never reached the worker.
This wires the full path: HTTP cancel/connection-drop → supervisor →
orchestrator → ControlRequest::Cancel → worker → PyThreadState_SetAsyncExc.

Key changes:
- Orchestrator: add cancel_by_prediction_id (resolves pred ID → slot ID)
- Supervisor: cancel() delegates to orchestrator via spawned task
- SyncPredictionGuard: calls supervisor.cancel() on drop (not just token)
- Sync HTTP handler: spawns prediction in background task so slot lifetime
  is not tied to the HTTP connection (fixes permit leak on disconnect)
- cancel.rs: replace SIGUSR1 with PyThreadState_SetAsyncExc for sync
  cancel (works on any thread, not just main); fix fallback exception
  to create a proper class, not an instance
- worker_bridge: capture py_thread_id at prediction start, use it for
  cancel_sync_thread(); async cancel still uses future.cancel()
- Tests: 5 new Python cancel tests (explicit cancel sync/async,
  connection-drop), 2 txtar integration tests
# Conflicts:
#	crates/coglet/src/supervisor.rs
#	crates/coglet/src/transport/http/routes.rs
@tempusfrangit tempusfrangit requested a review from a team as a code owner February 24, 2026 01:02
@tempusfrangit tempusfrangit added this to the 0.17.0 Release milestone Feb 24, 2026
…d cancelled

PyThreadState_SetAsyncExc injects CancelationException into the Python
thread, but predict_worker catches it as a generic PyErr and returns
PredictionError::Failed. After predict_worker returns, check if the
slot was marked cancelled and upgrade to PredictionError::Cancelled.

Also applies the same fix to the sync train path.
…g SDK

CancelationException now derives from BaseException (not Exception) so
that bare `except Exception` blocks in user predict code cannot
swallow cancellation — matching KeyboardInterrupt and CancelledError.

Key changes:
- cancel.rs: use pyo3_stub_gen::create_exception! to define a static
  BaseException subclass, replacing the dynamic builtins.type() approach
  and the OnceLock<Py<PyAny>> storage
- lib.rs: register CancelationException on the coglet._impl module
- coglet/__init__.py: re-export CancelationException
- cog/exceptions.py: new module re-exporting from coglet with a
  pure-Python fallback for SDK-only environments
- cog/__init__.py: expose as cog.CancelationException
- stub_gen.rs: include CancelationException in public re-exports
- Regenerated stubs via mise run generate:stubs
PyThreadState_SetAsyncExc replaced SIGUSR1 as the sync cancel mechanism
but the old handler, CANCELABLE flag, CancelableGuard, and
enter_cancelable() were left behind. Nothing sends SIGUSR1 and nothing
reads the CANCELABLE flag on the cancel path.

Removed:
- _sigusr1_handler, install_signal_handler, CANCELABLE, CancelableGuard,
  enter_cancelable, is_cancelable from cancel.rs
- enter_cancelable() calls from worker_bridge.rs and predictor.rs
- install_signal_handler() call from lib.rs worker init
- _is_cancelable Python getter from Server

Also updated is_cancelation_exception in predictor.rs to use the static
cancel::CancelationException type instead of a dynamic import from the
nonexistent cog.server.exceptions module.
…glet

The pure-Python fallback CancelationException conflicted with pyright
when coglet was installed (two incompatible types with the same name).
Since coglet is always present at runtime, just re-export directly.
- python.md: add Cancellation section with CancelationException usage,
  import paths, BaseException semantics, and re-raise requirement
- http.md: update cancel endpoint docs to use cog.CancelationException
  import path (was cog.server.exceptions) and link to new SDK docs
- Regenerated llms.txt
Async predictors receive asyncio.CancelledError instead. Added a table
to python.md making the distinction clear, and updated http.md to
mention both exception types.
The typecheck in lint-python needs the coglet wheel to resolve
CancelationException (now exported from coglet). Without the local
wheel, nox falls back to PyPI which doesn't have unreleased changes.

Adds build-rust as an optional dependency and downloads the
CogletRustWheel artifact when the rust build succeeded, matching
the pattern already used by test-coglet-python.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements end-to-end cancellation wiring from the HTTP layer to the worker subprocess, addressing issue #2748. The implementation replaces the previous SIGUSR1 signal-based approach with PyThreadState_SetAsyncExc for sync predictions, and makes CancelationException a proper public SDK export as a BaseException subclass.

Changes:

  • Wires cancellation through supervisor → orchestrator → worker using cancel_by_prediction_id() and ControlRequest::Cancel
  • Implements CancelationException as a static BaseException type (not Exception) exported through the public SDK
  • Refactors sync HTTP prediction handling to spawn work in background tasks, using SyncPredictionGuard for connection-drop cancellation
  • Removes dead SIGUSR1 signal handler code, CANCELABLE flag, and CancelableGuard
  • Adds comprehensive documentation and integration/unit tests

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
python/cog/exceptions.py New public exceptions module exporting CancelationException
python/cog/__init__.py Adds CancelationException to public API exports
integration-tests/tests/cancel_sync_prediction.txtar Integration test for sync prediction cancellation via HTTP endpoint
integration-tests/tests/cancel_async_prediction.txtar Integration test for async prediction cancellation via HTTP endpoint
docs/python.md New Cancellation section documenting CancelationException usage and semantics
docs/llms.txt Regenerated documentation with cancellation updates
docs/http.md Updated cancel endpoint documentation with correct import paths
crates/coglet/src/transport/http/routes.rs Refactors sync predictions to spawn in background tasks with SyncPredictionGuard
crates/coglet/src/supervisor.rs Implements cancel() method that delegates to orchestrator
crates/coglet/src/service.rs Sets orchestrator on supervisor for cancel delegation
crates/coglet/src/orchestrator.rs Adds cancel_by_prediction_id() trait method and event loop cancel handler
crates/coglet-python/tests/test_coglet.py Adds Python unit tests for sync/async cancellation and connection drops
crates/coglet-python/src/worker_bridge.rs Captures py_thread_id at prediction start, implements cancel via PyThreadState_SetAsyncExc
crates/coglet-python/src/predictor.rs Updates is_cancelation_exception to use static type
crates/coglet-python/src/lib.rs Exports CancelationException and removes dead signal handler code
crates/coglet-python/src/cancel.rs Complete rewrite: implements PyThreadState_SetAsyncExc-based cancellation, removes SIGUSR1
crates/coglet-python/src/bin/stub_gen.rs Adds CancelationException to public re-exports
crates/coglet-python/coglet/_impl.pyi Generated stub with CancelationException class definition
crates/coglet-python/coglet/__init__.pyi Generated stub re-exporting CancelationException
crates/coglet-python/coglet/__init__.py Exports CancelationException from _impl
.github/workflows/ci.yaml Adds coglet wheel download to lint-python job with proper dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

stub_gen.rs now generates coglet/__init__.pyi directly instead of
having mypy stubgen overwrite it. Private submodules like _sdk use
relative imports (from . import) so type checkers resolve them via
the filesystem rather than through the native _impl module.
markphelps
markphelps previously approved these changes Feb 24, 2026
Verify that cancelling the same prediction multiple times doesn't
panic or break the server. Covers both busy-loop (immediate cancel at
bytecode boundaries) and time.sleep/nanosleep (cancel deferred until
the C-level block returns).

- Rust unit test: supervisor repeated_cancel_is_idempotent
- Python test: parametrized over busy_loop and nanosleep variants
- Integration test: repeated cancel with time.sleep predictor
The build:cog task was setting COG_VERSION to the clean Cargo.toml
version (e.g. 0.17.0), which caused isDev=false and skipped local
wheel auto-detection in dist/. Remove the override so goreleaser's
snapshot template produces a proper dev version (e.g. 0.17.1-dev+g<sha>).
@tempusfrangit tempusfrangit enabled auto-merge (squash) February 24, 2026 23:40
@tempusfrangit tempusfrangit enabled auto-merge (squash) February 25, 2026 00:00
@tempusfrangit tempusfrangit enabled auto-merge (squash) February 25, 2026 00:09
Copy link

@gandalfhz-cf gandalfhz-cf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tempusfrangit tempusfrangit merged commit 1999533 into main Feb 25, 2026
66 of 68 checks passed
@tempusfrangit tempusfrangit deleted the fix/cancellation-wiring branch February 25, 2026 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cancellation is not wired up in the new cog implementation

4 participants