-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
🔴 Required Information
Is your feature request related to a specific problem?
Yes. Currently, the ADK BasePlugin architecture does not report errors at the agent or invocation (run) level. The framework supports on_model_error_callback and on_tool_error_callback, but if an unhandled exception occurs that crashes the agent process (e.g. fatal tool error, model crashing out of the runner.run_async() loop), the after_agent_callback and after_run_callback bindings are skipped.
Consequently, the AGENT_COMPLETED and INVOCATION_COMPLETED events are never emitted to observability sinks (like the BigQueryAgentAnalyticsPlugin). Because these failed runs disappear from the denominator in standard reports, systems will display artificially perfect durations and artificially high success rates (as the failures are never formally registered).
Describe the Solution You'd Like
The ADK BasePlugin should introduce new dedicated lifecycle error hooks:
on_agent_error_callback(agent: Agent, error: Exception, ...)on_run_error_callback(invocation_id: str, error: Exception, ...)
The BigQueryAgentAnalyticsPlugin (and other telemetry plugins) should implement these to formally emitAGENT_ERRORandINVOCATION_ERRORevent types into the downstream datastore immediately prior to the process dying.
Impact on your work
Without these hooks, teams adopting ADK tooling for reliable enterprise deployment cannot accurately measure their agentic pipeline's systemic failure rate. Observability platforms must rely on complicated post-processing state engines rather than simple streaming event aggregations. This is critical for enterprise observability.
🟡 Recommended Information
Describe Alternatives You've Considered
- Time-Based Heuristic (Current workaround): Implementing custom BigQuery views that enforce a timeout logic. If a
STARTINGevent hasn't received a correspondingCOMPLETEDevent within that window, we manually flag it as anERROR. This risks false positives for exceptionally long-running agents, and forces downstream dashboarding metrics to accommodate missing data rather than proper upstream telemetry.