feat(plugins): add LlmResiliencePlugin with retries and fallbacks

chillum-codeX · warp-agent · chillum-codeX · commit e0bb1fa2d6a6 · 2026-02-18T21:10:32.000+05:30
Adds plugin export, unit tests, resilient sample, PR body updates, and contribution note with validation evidence.

Co-Authored-By: Warp &lt;agent@warp.dev&gt;
diff --git a/CONTRIBUTION_NOTE.txt b/CONTRIBUTION_NOTE.txt
@@ -0,0 +1,92 @@
+LlmResiliencePlugin Contribution Note
+=====================================
+
+What we implemented
+-------------------
+1) New plugin:
+   - Added src/google/adk/plugins/llm_resilience_plugin.py
+   - Provides retry + backoff + jitter + optional model fallbacks for LLM errors.
+
+2) Plugin export:
+   - Updated src/google/adk/plugins/__init__.py
+   - Exported LlmResiliencePlugin in __all__ for discoverability.
+
+3) Unit tests:
+   - Added tests/unittests/plugins/test_llm_resilience_plugin.py
+   - Covered:
+     - retry success on same model
+     - fallback model after retries
+     - non-transient errors bubbling correctly
+
+4) Usage sample:
+   - Added samples/resilient_agent.py
+   - Demonstrates plugin setup and recovery behavior.
+
+5) PR narrative and testing evidence:
+   - Updated PR_BODY.md to match repository PR template:
+     - issue/description
+     - testing plan
+     - manual E2E output
+     - checklist
+
+
+Why this contribution is meaningful
+-----------------------------------
+1) Solves a real reliability gap:
+   Production agents frequently face transient failures (timeouts, 429, 5xx).
+   This change centralizes resilience behavior and removes repeated ad-hoc retry code.
+
+2) Low-risk architecture:
+   The feature is plugin-based and opt-in.
+   Existing users are unaffected unless they configure the plugin.
+
+3) Practical for maintainers and users:
+   Includes tests and a runnable sample, reducing review friction and making adoption easier.
+
+4) Aligns with ADK extensibility:
+   Keeps resilience logic at the plugin layer without changing core runner/flow behavior.
+
+
+Key design reasons
+------------------
+1) on_model_error_callback hook:
+   Best fit for intercepting model failures and deciding retry/fallback behavior.
+
+2) Exponential backoff with jitter:
+   Reduces retry storms and aligns with standard distributed-system reliability practices.
+
+3) Model fallback support:
+   Improves chance of successful completion when a single provider/model is degraded.
+
+4) Robust provider response handling:
+   Supports async-generator and coroutine style returns to handle provider differences.
+
+5) Type-safety/cycle-safe update:
+   Added TYPE_CHECKING import pattern for InvocationContext to avoid runtime issues.
+
+
+Validation performed
+--------------------
+1) Formatting:
+   - isort applied to changed Python files
+   - pyink applied to changed Python files
+
+2) Unit tests:
+   - Command:
+     .venv/Scripts/python -m pytest tests/unittests/plugins/test_llm_resilience_plugin.py -v
+   - Result: 3 passed
+
+3) Manual E2E sample run:
+   - Command:
+     .venv/Scripts/python samples/resilient_agent.py
+   - Observed:
+     LLM retry attempt 1 failed: TimeoutError('Simulated transient failure')
+     Collected 1 events
+     MODEL: Recovered on retry!
+
+
+Scope and limitations
+---------------------
+- This PR focuses on LLM call resilience only.
+- Live bidirectional streaming paths are out of scope for this change.
+- Future enhancements can add per-exception policies and circuit-breaker style controls.
diff --git a/PR_BODY.md b/PR_BODY.md
@@ -1,70 +1,90 @@
 # feat(plugins): LlmResiliencePlugin – configurable retries/backoff and model fallbacks
 
-## Motivation
-Production agents need first-class resilience to transient LLM/API failures (timeouts, 429/5xx). Today, retry/fallback logic is ad-hoc and duplicated across projects. This PR introduces a plugin-based, opt-in resilience layer for LLM calls that aligns with ADK's extensibility philosophy and addresses recurring requests:
-
-- #1214 Add built-in retry mechanism
-- #2561 Retry mechanism gaps for common network errors (httpx…)
-- Discussions: #2292, #3199 on fallbacks and max retries
-
-## Summary
-Adds a new plugin `LlmResiliencePlugin` which intercepts model errors and performs:
-- Configurable retries with exponential backoff + jitter
-- Transient error detection (HTTP 429/500/502/503/504, httpx timeouts/connect errors, asyncio timeouts)
-- Optional model fallbacks (try a sequence of models if primary continues to fail)
-- Works for standard `generate_content_async` flows; supports SSE streaming by consuming to final response
-
-No core runner changes; this is a pure plugin. Default behavior remains unchanged unless the plugin is configured.
-
-## Implementation Details
-- File: `src/google/adk/plugins/llm_resilience_plugin.py`
-- Hooks into `on_model_error_callback` to decide whether to handle an error
-- Retries use exponential backoff with jitter (configurable):
-  - `max_retries`, `backoff_initial`, `backoff_multiplier`, `max_backoff`, `jitter`
-- Fallbacks use `LLMRegistry.new_llm(model)` to instantiate alternative models on failure
-- Robust handling of provider return types:
-  - Async generator (iterates until final non-partial response)
-  - Coroutine (some providers may return a single `LlmResponse`)
-- Avoids circular imports using duck-typed access to InvocationContext (works with Context alias)
-- Maintains clean separation; no modification to runners or flows
-
-## Tests
-- `tests/unittests/plugins/test_llm_resilience_plugin.py`
-  - `test_retry_success_on_same_model`: transient error triggers retry → success
-  - `test_fallback_model_used_after_retries`: failing primary uses fallback model → success
-  - `test_non_transient_error_bubbles`: non-transient error is ignored by plugin (propagate)
-
-All tests in this module pass locally:
+### Link to Issue or Description of Change
 
+**1. Link to an existing issue (if applicable):**
+
+- Closes: N/A
+- Related: #1214
+- Related: #2561
+- Related discussions: #2292, #3199
+
+**2. Or, if no issue exists, describe the change:**
+
+**Problem:**
+Production agents need first-class resilience to transient LLM/API failures
+(timeouts, HTTP 429/5xx). Today, retry/fallback logic is often ad-hoc and
+duplicated across projects.
+
+**Solution:**
+Introduce an opt-in plugin, `LlmResiliencePlugin`, that handles transient LLM
+errors with configurable retries (exponential backoff + jitter) and optional
+model fallbacks, without modifying core runner/flow logic.
+
+### Summary
+
+- Added `src/google/adk/plugins/llm_resilience_plugin.py`.
+- Exported `LlmResiliencePlugin` in `src/google/adk/plugins/__init__.py`.
+- Added unit tests in
+  `tests/unittests/plugins/test_llm_resilience_plugin.py`:
+  - `test_retry_success_on_same_model`
+  - `test_fallback_model_used_after_retries`
+  - `test_non_transient_error_bubbles`
+- Added `samples/resilient_agent.py` demo.
+
+### Testing Plan
+
+**Unit Tests:**
+
+- [x] I have added or updated unit tests for my change.
+- [x] All unit tests pass locally.
+
+Command run:
+
+```shell
+.venv/Scripts/python -m pytest tests/unittests/plugins/test_llm_resilience_plugin.py -v
 ```
-PYTHONPATH=src pytest -q tests/unittests/plugins/test_llm_resilience_plugin.py
-# 3 passed
+
+Result summary:
+
+```text
+collected 3 items
+tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_fallback_model_used_after_retries PASSED
+tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_non_transient_error_bubbles PASSED
+tests/unittests/plugins/test_llm_resilience_plugin.py::TestLlmResiliencePlugin::test_retry_success_on_same_model PASSED
+3 passed
 ```
 
-## Sample
-- `samples/resilient_agent.py` demonstrates configuring the plugin with an in-memory runner and a demo model that fails once then succeeds.
+**Manual End-to-End (E2E) Tests:**
 
 Run sample:
 
+```shell
+.venv/Scripts/python samples/resilient_agent.py
 ```
-PYTHONPATH=$(pwd)/src python samples/resilient_agent.py
+
+Observed output:
+
+```text
+LLM retry attempt 1 failed: TimeoutError('Simulated transient failure')
+Collected 1 events
+MODEL: Recovered on retry!
 ```
 
-## Backwards Compatibility
-- Non-breaking: users opt-in by passing the plugin into `Runner(..., plugins=[LlmResiliencePlugin(...)])`
-- No changes to public APIs beyond exporting the plugin in `google.adk.plugins`
+### Checklist
 
-## Limitations & Future Work
-- Focused on LLM failures. Tool-level resilience is addressed by `ReflectAndRetryToolPlugin`.
-- Circuit-breaking and per-exception policies could be added in a follow-up (`dev_3` item).
-- Live bidi streaming not yet handled by this plugin; future work may extend to `BaseLlmConnection` flows.
+- [x] I have read the [CONTRIBUTING.md](https://github.com/google/adk-python/blob/main/CONTRIBUTING.md) document.
+- [x] I have performed a self-review of my own code.
+- [x] I have commented my code, particularly in hard-to-understand areas.
+- [x] I have added tests that prove my fix is effective or that my feature works.
+- [x] New and existing unit tests pass locally with my changes.
+- [x] I have manually tested my changes end-to-end.
+- [x] Any dependent changes have been merged and published in downstream modules. (N/A; no dependent changes)
 
-## Docs
-- Exported via `google.adk.plugins.__all__` to ease discovery
-- Included inline docstrings and sample; can be integrated into the docs site in a separate PR
+### Additional context
 
-## Checklist
-- [x] Unit tests for new behavior
-- [x] Sample demonstrating usage
-- [x] No changes to core runner/flow logic
-- [x] Code formatted and linted per repository standards
+- Non-breaking: users opt in via
+  `Runner(..., plugins=[LlmResiliencePlugin(...)])`.
+- Transient detection currently targets common HTTP/timeouts and can be extended
+  in follow-ups (e.g., per-exception policy, circuit breaking).
+- Live bidirectional streaming paths are out of scope for this PR.
diff --git a/samples/resilient_agent.py b/samples/resilient_agent.py
@@ -12,15 +12,15 @@
 import asyncio
 
 from google.adk.agents.llm_agent import LlmAgent
+from google.adk.artifacts.in_memory_artifact_service import InMemoryArtifactService
+from google.adk.memory.in_memory_memory_service import InMemoryMemoryService
 from google.adk.models.base_llm import BaseLlm
 from google.adk.models.llm_request import LlmRequest
 from google.adk.models.llm_response import LlmResponse
 from google.adk.models.registry import LLMRegistry
 from google.adk.plugins.llm_resilience_plugin import LlmResiliencePlugin
 from google.adk.runners import Runner
 from google.adk.sessions.in_memory_session_service import InMemorySessionService
-from google.adk.artifacts.in_memory_artifact_service import InMemoryArtifactService
-from google.adk.memory.in_memory_memory_service import InMemoryMemoryService
 from google.genai import types
 
 
@@ -32,14 +32,17 @@ class DemoFailThenSucceedModel(BaseLlm):
   def supported_models(cls) -> list[str]:
     return ["demo-fail-succeed"]
 
-  async def generate_content_async(self, llm_request: LlmRequest, stream: bool = False):
+  async def generate_content_async(
+      self, llm_request: LlmRequest, stream: bool = False
+  ):
     # Fail for the first attempt, then succeed
     self.attempts += 1
     if self.attempts < 2:
       raise TimeoutError("Simulated transient failure")
     yield LlmResponse(
         content=types.Content(
-            role="model", parts=[types.Part.from_text(text="Recovered on retry!")]
+            role="model",
+            parts=[types.Part.from_text(text="Recovered on retry!")],
         ),
         partial=False,
     )
@@ -76,12 +79,16 @@ async def main():
   )
 
   # Create a session and run once
-  session = await session_service.create_session(app_name="resilience_demo", user_id="demo")
+  session = await session_service.create_session(
+      app_name="resilience_demo", user_id="demo"
+  )
   events = []
   async for ev in runner.run_async(
       user_id=session.user_id,
       session_id=session.id,
-      new_message=types.Content(role="user", parts=[types.Part.from_text(text="hello")]),
+      new_message=types.Content(
+          role="user", parts=[types.Part.from_text(text="hello")]
+      ),
   ):
     events.append(ev)
 
diff --git a/src/google/adk/plugins/__init__.py b/src/google/adk/plugins/__init__.py
@@ -14,12 +14,11 @@
 
 from .base_plugin import BasePlugin
 from .debug_logging_plugin import DebugLoggingPlugin
+from .llm_resilience_plugin import LlmResiliencePlugin
 from .logging_plugin import LoggingPlugin
 from .plugin_manager import PluginManager
 from .reflect_retry_tool_plugin import ReflectAndRetryToolPlugin
 
-from .llm_resilience_plugin import LlmResiliencePlugin
-
 __all__ = [
     'BasePlugin',
     'DebugLoggingPlugin',
diff --git a/src/google/adk/plugins/llm_resilience_plugin.py b/src/google/adk/plugins/llm_resilience_plugin.py
diff --git a/tests/unittests/plugins/test_llm_resilience_plugin.py b/tests/unittests/plugins/test_llm_resilience_plugin.py