extension/llm/server: worker-based OpenAI-compatible HTTP server#19994
Open
mergennachin wants to merge 12 commits into
Open
extension/llm/server: worker-based OpenAI-compatible HTTP server#19994mergennachin wants to merge 12 commits into
mergennachin wants to merge 12 commits into
Conversation
[ghstack-poisoned]
Contributor
Author
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19994
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 3 Unrelated FailuresAs of commit 1703518 with merge base f0dff03 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
[ghstack-poisoned]
mergennachin
added a commit
that referenced
this pull request
Jun 3, 2026
Wire the foundations into a FastAPI app: /v1/chat/completions (streaming and non-streaming), /v1/models, /health. Request validation rejects parameters the server can't honor (top_p != 1, seed, n > 1, frequency/presence penalties, top_k, logit_bias, logprobs, response_format other than text, non-positive max_tokens, tool_choice = required / specific function) instead of silently ignoring them; stop sequences are applied before tool parsing; client cancellation calls runner.stop(); usage is reported. runner_pool admits physical sessions per the engine's serving_capacity() (single-slot on XNNPACK, with concurrent requests queueing on the resident session) and routes by prefix affinity. Hermetic tests (FakeRunner via dependency injection) cover the contract, templating, sampling params, tool calls and the pool; conformance/ is a black-box suite runnable against any live OpenAI server. READMEs document the flags and scope. Last of four stacked commits; depends on the bindings and serving foundations. ghstack-source-id: acef8e6 ghstack-comment-id: 4617263008 Pull-Request: #19994
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
This was referenced Jun 9, 2026
[ghstack-poisoned]
[ghstack-poisoned]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wire the foundations into a FastAPI app: /v1/chat/completions (streaming and
non-streaming), /v1/models, /health. Request validation rejects parameters the
server can't honor (top_p != 1, seed, n > 1, frequency/presence penalties,
top_k, logit_bias, logprobs, response_format other than text, non-positive
max_tokens, tool_choice = required / specific function) instead of silently
ignoring them; stop sequences are applied before tool parsing; usage is reported.
The Python process is control plane only: it loads no model and imports no
runtime pybind. Model execution runs in a separate C++ worker process
(cpp/text_llm_worker.cpp, over TextLLMEngine/TextLLMSession) that the control
plane spawns and drives over a small JSONL protocol (worker_client.py). The
protocol and the decode loop (reset, encode, context clamp, prefill, decode,
UTF-8 assembly, stop handling, stats, finish_reason) live in a shared header,
cpp/worker_loop.h, so model-specific workers reuse them; text_llm_worker only
constructs the engine/session and runs the loop.
The Python execution boundary is ServingChat -> SessionRuntime -> WorkerClient
-> C++ worker. ServingChat is a thin OpenAI adapter (protocol, templating, tool
parsing, streaming/SSE). SessionRuntime is the stateful runtime over a single
WorkerClient: it serializes the worker (one in-flight request) and bridges the
worker's blocking generate() into an async token stream. WorkerClient is raw
JSONL transport. There is no RunnerPool and no multi-worker scheduling/affinity
in this milestone; concurrent requests queue.
SessionRuntime is introduced here as the stable control-plane boundary for the
rest of the stack: its method/field surface (session_id routing, reset, warm-
resume stats on GenStats, token-ID prompt_segments on PromptInput/_WorkerRequest)
is defined once, but the behavior and tests that activate those features land in
their natural later commits -- named-session routing/admission (V2a), warm
append-only resume (V2b.1), and token-ID prompt segments (V2b.1.5). This keeps
the boundary stable for whole-stack review instead of re-shaping it every commit.
There is no prefix cache and no Python-side KV state; cancellation is
best-effort (the control plane stops consuming, the worker finishes the
in-flight request). Hermetic tests (a FakeRunner worker) cover the contract,
templating, sampling params, tool calls, the runtime, and the worker protocol;
conformance/ is a black-box suite runnable against any live OpenAI server.
READMEs document the flags and scope.
Depends on the serving foundations.
Part of #20001