Skip to content

Commit e0bb1fa

Browse files
feat(plugins): add LlmResiliencePlugin with retries and fallbacks
Adds plugin export, unit tests, resilient sample, PR body updates, and contribution note with validation evidence. Co-Authored-By: Warp <agent@warp.dev>
1 parent a216a41 commit e0bb1fa

6 files changed

Lines changed: 247 additions & 85 deletions

File tree

CONTRIBUTION_NOTE.txt

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
LlmResiliencePlugin Contribution Note
2+
=====================================
3+
4+
What we implemented
5+
-------------------
6+
1) New plugin:
7+
- Added src/google/adk/plugins/llm_resilience_plugin.py
8+
- Provides retry + backoff + jitter + optional model fallbacks for LLM errors.
9+
10+
2) Plugin export:
11+
- Updated src/google/adk/plugins/__init__.py
12+
- Exported LlmResiliencePlugin in __all__ for discoverability.
13+
14+
3) Unit tests:
15+
- Added tests/unittests/plugins/test_llm_resilience_plugin.py
16+
- Covered:
17+
- retry success on same model
18+
- fallback model after retries
19+
- non-transient errors bubbling correctly
20+
21+
4) Usage sample:
22+
- Added samples/resilient_agent.py
23+
- Demonstrates plugin setup and recovery behavior.
24+
25+
5) PR narrative and testing evidence:
26+
- Updated PR_BODY.md to match repository PR template:
27+
- issue/description
28+
- testing plan
29+
- manual E2E output
30+
- checklist
31+
32+
33+
Why this contribution is meaningful
34+
-----------------------------------
35+
1) Solves a real reliability gap:
36+
Production agents frequently face transient failures (timeouts, 429, 5xx).
37+
This change centralizes resilience behavior and removes repeated ad-hoc retry code.
38+
39+
2) Low-risk architecture:
40+
The feature is plugin-based and opt-in.
41+
Existing users are unaffected unless they configure the plugin.
42+
43+
3) Practical for maintainers and users:
44+
Includes tests and a runnable sample, reducing review friction and making adoption easier.
45+
46+
4) Aligns with ADK extensibility:
47+
Keeps resilience logic at the plugin layer without changing core runner/flow behavior.
48+
49+
50+
Key design reasons
51+
------------------
52+
1) on_model_error_callback hook:
53+
Best fit for intercepting model failures and deciding retry/fallback behavior.
54+
55+
2) Exponential backoff with jitter:
56+
Reduces retry storms and aligns with standard distributed-system reliability practices.
57+
58+
3) Model fallback support:
59+
Improves chance of successful completion when a single provider/model is degraded.
60+
61+
4) Robust provider response handling:
62+
Supports async-generator and coroutine style returns to handle provider differences.
63+
64+
5) Type-safety/cycle-safe update:
65+
Added TYPE_CHECKING import pattern for InvocationContext to avoid runtime issues.
66+
67+
68+
Validation performed
69+
--------------------
70+
1) Formatting:
71+
- isort applied to changed Python files
72+
- pyink applied to changed Python files
73+
74+
2) Unit tests:
75+
- Command:
76+
.venv/Scripts/python -m pytest tests/unittests/plugins/test_llm_resilience_plugin.py -v
77+
- Result: 3 passed
78+
79+
3) Manual E2E sample run:
80+
- Command:
81+
.venv/Scripts/python samples/resilient_agent.py
82+
- Observed:
83+
LLM retry attempt 1 failed: TimeoutError('Simulated transient failure')
84+
Collected 1 events
85+
MODEL: Recovered on retry!
86+
87+
88+
Scope and limitations
89+
---------------------
90+
- This PR focuses on LLM call resilience only.
91+
- Live bidirectional streaming paths are out of scope for this change.
92+
- Future enhancements can add per-exception policies and circuit-breaker style controls.

PR_BODY.md

Lines changed: 75 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,90 @@
11
# feat(plugins): LlmResiliencePlugin – configurable retries/backoff and model fallbacks
22

3-
## Motivation
4-
Production agents need first-class resilience to transient LLM/API failures (timeouts, 429/5xx). Today, retry/fallback logic is ad-hoc and duplicated across projects. This PR introduces a plugin-based, opt-in resilience layer for LLM calls that aligns with ADK's extensibility philosophy and addresses recurring requests:
5-
6-
- #1214 Add built-in retry mechanism
7-
- #2561 Retry mechanism gaps for common network errors (httpx…)
8-
- Discussions: #2292, #3199 on fallbacks and max retries
9-
10-
## Summary
11-
Adds a new plugin `LlmResiliencePlugin` which intercepts model errors and performs:
12-
- Configurable retries with exponential backoff + jitter
13-
- Transient error detection (HTTP 429/500/502/503/504, httpx timeouts/connect errors, asyncio timeouts)
14-
- Optional model fallbacks (try a sequence of models if primary continues to fail)
15-
- Works for standard `generate_content_async` flows; supports SSE streaming by consuming to final response
16-
17-
No core runner changes; this is a pure plugin. Default behavior remains unchanged unless the plugin is configured.
18-
19-
## Implementation Details
20-
- File: `src/google/adk/plugins/llm_resilience_plugin.py`
21-
- Hooks into `on_model_error_callback` to decide whether to handle an error
22-
- Retries use exponential backoff with jitter (configurable):
23-
- `max_retries`, `backoff_initial`, `backoff_multiplier`, `max_backoff`, `jitter`
24-
- Fallbacks use `LLMRegistry.new_llm(model)` to instantiate alternative models on failure
25-
- Robust handling of provider return types:
26-
- Async generator (iterates until final non-partial response)
27-
- Coroutine (some providers may return a single `LlmResponse`)
28-
- Avoids circular imports using duck-typed access to InvocationContext (works with Context alias)
29-
- Maintains clean separation; no modification to runners or flows
30-
31-
## Tests
32-
- `tests/unittests/plugins/test_llm_resilience_plugin.py`
33-
- `test_retry_success_on_same_model`: transient error triggers retry → success
34-
- `test_fallback_model_used_after_retries`: failing primary uses fallback model → success
35-
- `test_non_transient_error_bubbles`: non-transient error is ignored by plugin (propagate)
36-
37-
All tests in this module pass locally:
3+
### Link to Issue or Description of Change
384

5+
**1. Link to an existing issue (if applicable):**
6+
7+
- Closes: N/A
8+
- Related: #1214
9+
- Related: #2561
10+
- Related discussions: #2292, #3199
11+
12+
**2. Or, if no issue exists, describe the change:**
13+
14+
**Problem:**
15+
Production agents need first-class resilience to transient LLM/API failures
16+
(timeouts, HTTP 429/5xx). Today, retry/fallback logic is often ad-hoc and
17+
duplicated across projects.
18+
19+
**Solution:**
20+
Introduce an opt-in plugin, `LlmResiliencePlugin`, that handles transient LLM
21+
errors with configurable retries (exponential backoff + jitter) and optional
22+
model fallbacks, without modifying core runner/flow logic.
23+
24+
### Summary
25+
26+
- Added `src/google/adk/plugins/llm_resilience_plugin.py`.
27+
- Exported `LlmResiliencePlugin` in `src/google/adk/plugins/__init__.py`.
28+
- Added unit tests in
29+
`tests/unittests/plugins/test_llm_resilience_plugin.py`:
30+
- `test_retry_success_on_same_model`
31+
- `test_fallback_model_used_after_retries`
32+
- `test_non_transient_error_bubbles`
33+
- Added `samples/resilient_agent.py` demo.
34+
35+
### Testing Plan
36+
37+
**Unit Tests:**
38+
39+
- [x] I have added or updated unit tests for my change.
40+
- [x] All unit tests pass locally.
41+
42+
Command run:
43+
44+
```shell
45+
.venv/Scripts/python -m pytest tests/unittests/plugins/test_llm_resilience_plugin.py -v
3946
```
40-
PYTHONPATH=src pytest -q tests/unittests/plugins/test_llm_resilience_plugin.py
41-
# 3 passed
47+
48+
Result summary:
49+
50+
```text
51+
collected 3 items
52+
tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_fallback_model_used_after_retries PASSED
53+
tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_non_transient_error_bubbles PASSED
54+
tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_retry_success_on_same_model PASSED
55+
3 passed
4256
```
4357

44-
## Sample
45-
- `samples/resilient_agent.py` demonstrates configuring the plugin with an in-memory runner and a demo model that fails once then succeeds.
58+
**Manual End-to-End (E2E) Tests:**
4659

4760
Run sample:
4861

62+
```shell
63+
.venv/Scripts/python samples/resilient_agent.py
4964
```
50-
PYTHONPATH=$(pwd)/src python samples/resilient_agent.py
65+
66+
Observed output:
67+
68+
```text
69+
LLM retry attempt 1 failed: TimeoutError('Simulated transient failure')
70+
Collected 1 events
71+
MODEL: Recovered on retry!
5172
```
5273

53-
## Backwards Compatibility
54-
- Non-breaking: users opt-in by passing the plugin into `Runner(..., plugins=[LlmResiliencePlugin(...)])`
55-
- No changes to public APIs beyond exporting the plugin in `google.adk.plugins`
74+
### Checklist
5675

57-
## Limitations & Future Work
58-
- Focused on LLM failures. Tool-level resilience is addressed by `ReflectAndRetryToolPlugin`.
59-
- Circuit-breaking and per-exception policies could be added in a follow-up (`dev_3` item).
60-
- Live bidi streaming not yet handled by this plugin; future work may extend to `BaseLlmConnection` flows.
76+
- [x] I have read the [CONTRIBUTING.md](https://github.com/google/adk-python/blob/main/CONTRIBUTING.md) document.
77+
- [x] I have performed a self-review of my own code.
78+
- [x] I have commented my code, particularly in hard-to-understand areas.
79+
- [x] I have added tests that prove my fix is effective or that my feature works.
80+
- [x] New and existing unit tests pass locally with my changes.
81+
- [x] I have manually tested my changes end-to-end.
82+
- [x] Any dependent changes have been merged and published in downstream modules. (N/A; no dependent changes)
6183

62-
## Docs
63-
- Exported via `google.adk.plugins.__all__` to ease discovery
64-
- Included inline docstrings and sample; can be integrated into the docs site in a separate PR
84+
### Additional context
6585

66-
## Checklist
67-
- [x] Unit tests for new behavior
68-
- [x] Sample demonstrating usage
69-
- [x] No changes to core runner/flow logic
70-
- [x] Code formatted and linted per repository standards
86+
- Non-breaking: users opt in via
87+
`Runner(..., plugins=[LlmResiliencePlugin(...)])`.
88+
- Transient detection currently targets common HTTP/timeouts and can be extended
89+
in follow-ups (e.g., per-exception policy, circuit breaking).
90+
- Live bidirectional streaming paths are out of scope for this PR.

samples/resilient_agent.py

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,15 @@
1212
import asyncio
1313

1414
from google.adk.agents.llm_agent import LlmAgent
15+
from google.adk.artifacts.in_memory_artifact_service import InMemoryArtifactService
16+
from google.adk.memory.in_memory_memory_service import InMemoryMemoryService
1517
from google.adk.models.base_llm import BaseLlm
1618
from google.adk.models.llm_request import LlmRequest
1719
from google.adk.models.llm_response import LlmResponse
1820
from google.adk.models.registry import LLMRegistry
1921
from google.adk.plugins.llm_resilience_plugin import LlmResiliencePlugin
2022
from google.adk.runners import Runner
2123
from google.adk.sessions.in_memory_session_service import InMemorySessionService
22-
from google.adk.artifacts.in_memory_artifact_service import InMemoryArtifactService
23-
from google.adk.memory.in_memory_memory_service import InMemoryMemoryService
2424
from google.genai import types
2525

2626

@@ -32,14 +32,17 @@ class DemoFailThenSucceedModel(BaseLlm):
3232
def supported_models(cls) -> list[str]:
3333
return ["demo-fail-succeed"]
3434

35-
async def generate_content_async(self, llm_request: LlmRequest, stream: bool = False):
35+
async def generate_content_async(
36+
self, llm_request: LlmRequest, stream: bool = False
37+
):
3638
# Fail for the first attempt, then succeed
3739
self.attempts += 1
3840
if self.attempts < 2:
3941
raise TimeoutError("Simulated transient failure")
4042
yield LlmResponse(
4143
content=types.Content(
42-
role="model", parts=[types.Part.from_text(text="Recovered on retry!")]
44+
role="model",
45+
parts=[types.Part.from_text(text="Recovered on retry!")],
4346
),
4447
partial=False,
4548
)
@@ -76,12 +79,16 @@ async def main():
7679
)
7780

7881
# Create a session and run once
79-
session = await session_service.create_session(app_name="resilience_demo", user_id="demo")
82+
session = await session_service.create_session(
83+
app_name="resilience_demo", user_id="demo"
84+
)
8085
events = []
8186
async for ev in runner.run_async(
8287
user_id=session.user_id,
8388
session_id=session.id,
84-
new_message=types.Content(role="user", parts=[types.Part.from_text(text="hello")]),
89+
new_message=types.Content(
90+
role="user", parts=[types.Part.from_text(text="hello")]
91+
),
8592
):
8693
events.append(ev)
8794

src/google/adk/plugins/__init__.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,11 @@
1414

1515
from .base_plugin import BasePlugin
1616
from .debug_logging_plugin import DebugLoggingPlugin
17+
from .llm_resilience_plugin import LlmResiliencePlugin
1718
from .logging_plugin import LoggingPlugin
1819
from .plugin_manager import PluginManager
1920
from .reflect_retry_tool_plugin import ReflectAndRetryToolPlugin
2021

21-
from .llm_resilience_plugin import LlmResiliencePlugin
22-
2322
__all__ = [
2423
'BasePlugin',
2524
'DebugLoggingPlugin',

0 commit comments

Comments
 (0)