|
1 | 1 | # feat(plugins): LlmResiliencePlugin – configurable retries/backoff and model fallbacks |
2 | 2 |
|
3 | | -## Motivation |
4 | | -Production agents need first-class resilience to transient LLM/API failures (timeouts, 429/5xx). Today, retry/fallback logic is ad-hoc and duplicated across projects. This PR introduces a plugin-based, opt-in resilience layer for LLM calls that aligns with ADK's extensibility philosophy and addresses recurring requests: |
5 | | - |
6 | | -- #1214 Add built-in retry mechanism |
7 | | -- #2561 Retry mechanism gaps for common network errors (httpx…) |
8 | | -- Discussions: #2292, #3199 on fallbacks and max retries |
9 | | - |
10 | | -## Summary |
11 | | -Adds a new plugin `LlmResiliencePlugin` which intercepts model errors and performs: |
12 | | -- Configurable retries with exponential backoff + jitter |
13 | | -- Transient error detection (HTTP 429/500/502/503/504, httpx timeouts/connect errors, asyncio timeouts) |
14 | | -- Optional model fallbacks (try a sequence of models if primary continues to fail) |
15 | | -- Works for standard `generate_content_async` flows; supports SSE streaming by consuming to final response |
16 | | - |
17 | | -No core runner changes; this is a pure plugin. Default behavior remains unchanged unless the plugin is configured. |
18 | | - |
19 | | -## Implementation Details |
20 | | -- File: `src/google/adk/plugins/llm_resilience_plugin.py` |
21 | | -- Hooks into `on_model_error_callback` to decide whether to handle an error |
22 | | -- Retries use exponential backoff with jitter (configurable): |
23 | | - - `max_retries`, `backoff_initial`, `backoff_multiplier`, `max_backoff`, `jitter` |
24 | | -- Fallbacks use `LLMRegistry.new_llm(model)` to instantiate alternative models on failure |
25 | | -- Robust handling of provider return types: |
26 | | - - Async generator (iterates until final non-partial response) |
27 | | - - Coroutine (some providers may return a single `LlmResponse`) |
28 | | -- Avoids circular imports using duck-typed access to InvocationContext (works with Context alias) |
29 | | -- Maintains clean separation; no modification to runners or flows |
30 | | - |
31 | | -## Tests |
32 | | -- `tests/unittests/plugins/test_llm_resilience_plugin.py` |
33 | | - - `test_retry_success_on_same_model`: transient error triggers retry → success |
34 | | - - `test_fallback_model_used_after_retries`: failing primary uses fallback model → success |
35 | | - - `test_non_transient_error_bubbles`: non-transient error is ignored by plugin (propagate) |
36 | | - |
37 | | -All tests in this module pass locally: |
| 3 | +### Link to Issue or Description of Change |
38 | 4 |
|
| 5 | +**1. Link to an existing issue (if applicable):** |
| 6 | + |
| 7 | +- Closes: N/A |
| 8 | +- Related: #1214 |
| 9 | +- Related: #2561 |
| 10 | +- Related discussions: #2292, #3199 |
| 11 | + |
| 12 | +**2. Or, if no issue exists, describe the change:** |
| 13 | + |
| 14 | +**Problem:** |
| 15 | +Production agents need first-class resilience to transient LLM/API failures |
| 16 | +(timeouts, HTTP 429/5xx). Today, retry/fallback logic is often ad-hoc and |
| 17 | +duplicated across projects. |
| 18 | + |
| 19 | +**Solution:** |
| 20 | +Introduce an opt-in plugin, `LlmResiliencePlugin`, that handles transient LLM |
| 21 | +errors with configurable retries (exponential backoff + jitter) and optional |
| 22 | +model fallbacks, without modifying core runner/flow logic. |
| 23 | + |
| 24 | +### Summary |
| 25 | + |
| 26 | +- Added `src/google/adk/plugins/llm_resilience_plugin.py`. |
| 27 | +- Exported `LlmResiliencePlugin` in `src/google/adk/plugins/__init__.py`. |
| 28 | +- Added unit tests in |
| 29 | + `tests/unittests/plugins/test_llm_resilience_plugin.py`: |
| 30 | + - `test_retry_success_on_same_model` |
| 31 | + - `test_fallback_model_used_after_retries` |
| 32 | + - `test_non_transient_error_bubbles` |
| 33 | +- Added `samples/resilient_agent.py` demo. |
| 34 | + |
| 35 | +### Testing Plan |
| 36 | + |
| 37 | +**Unit Tests:** |
| 38 | + |
| 39 | +- [x] I have added or updated unit tests for my change. |
| 40 | +- [x] All unit tests pass locally. |
| 41 | + |
| 42 | +Command run: |
| 43 | + |
| 44 | +```shell |
| 45 | +.venv/Scripts/python -m pytest tests/unittests/plugins/test_llm_resilience_plugin.py -v |
39 | 46 | ``` |
40 | | -PYTHONPATH=src pytest -q tests/unittests/plugins/test_llm_resilience_plugin.py |
41 | | -# 3 passed |
| 47 | + |
| 48 | +Result summary: |
| 49 | + |
| 50 | +```text |
| 51 | +collected 3 items |
| 52 | +tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_fallback_model_used_after_retries PASSED |
| 53 | +tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_non_transient_error_bubbles PASSED |
| 54 | +tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_retry_success_on_same_model PASSED |
| 55 | +3 passed |
42 | 56 | ``` |
43 | 57 |
|
44 | | -## Sample |
45 | | -- `samples/resilient_agent.py` demonstrates configuring the plugin with an in-memory runner and a demo model that fails once then succeeds. |
| 58 | +**Manual End-to-End (E2E) Tests:** |
46 | 59 |
|
47 | 60 | Run sample: |
48 | 61 |
|
| 62 | +```shell |
| 63 | +.venv/Scripts/python samples/resilient_agent.py |
49 | 64 | ``` |
50 | | -PYTHONPATH=$(pwd)/src python samples/resilient_agent.py |
| 65 | + |
| 66 | +Observed output: |
| 67 | + |
| 68 | +```text |
| 69 | +LLM retry attempt 1 failed: TimeoutError('Simulated transient failure') |
| 70 | +Collected 1 events |
| 71 | +MODEL: Recovered on retry! |
51 | 72 | ``` |
52 | 73 |
|
53 | | -## Backwards Compatibility |
54 | | -- Non-breaking: users opt-in by passing the plugin into `Runner(..., plugins=[LlmResiliencePlugin(...)])` |
55 | | -- No changes to public APIs beyond exporting the plugin in `google.adk.plugins` |
| 74 | +### Checklist |
56 | 75 |
|
57 | | -## Limitations & Future Work |
58 | | -- Focused on LLM failures. Tool-level resilience is addressed by `ReflectAndRetryToolPlugin`. |
59 | | -- Circuit-breaking and per-exception policies could be added in a follow-up (`dev_3` item). |
60 | | -- Live bidi streaming not yet handled by this plugin; future work may extend to `BaseLlmConnection` flows. |
| 76 | +- [x] I have read the [CONTRIBUTING.md](https://github.com/google/adk-python/blob/main/CONTRIBUTING.md) document. |
| 77 | +- [x] I have performed a self-review of my own code. |
| 78 | +- [x] I have commented my code, particularly in hard-to-understand areas. |
| 79 | +- [x] I have added tests that prove my fix is effective or that my feature works. |
| 80 | +- [x] New and existing unit tests pass locally with my changes. |
| 81 | +- [x] I have manually tested my changes end-to-end. |
| 82 | +- [x] Any dependent changes have been merged and published in downstream modules. (N/A; no dependent changes) |
61 | 83 |
|
62 | | -## Docs |
63 | | -- Exported via `google.adk.plugins.__all__` to ease discovery |
64 | | -- Included inline docstrings and sample; can be integrated into the docs site in a separate PR |
| 84 | +### Additional context |
65 | 85 |
|
66 | | -## Checklist |
67 | | -- [x] Unit tests for new behavior |
68 | | -- [x] Sample demonstrating usage |
69 | | -- [x] No changes to core runner/flow logic |
70 | | -- [x] Code formatted and linted per repository standards |
| 86 | +- Non-breaking: users opt in via |
| 87 | + `Runner(..., plugins=[LlmResiliencePlugin(...)])`. |
| 88 | +- Transient detection currently targets common HTTP/timeouts and can be extended |
| 89 | + in follow-ups (e.g., per-exception policy, circuit breaking). |
| 90 | +- Live bidirectional streaming paths are out of scope for this PR. |
0 commit comments