Skip to content

Gemini: cannot bound or diagnose silent reads — HttpOptions.timeout only sets aiohttp total #5802

@KrishnaVemu

Description

@KrishnaVemu

Problem

In production we periodically see LlmAgent → Vertex calls go silent for minutes, then surface as a bare asyncio.TimeoutError with no actionable message. A real incident: one LLM call completed cleanly at 08:16:03, the next went silent until 08:19:29 (~3.5 minutes) before raising. Nothing in the logs in the silent window. From the bare exception, we cannot distinguish any of these:

  • Never connected (DNS / TCP / TLS failure)
  • Connected, request accepted, no response body ever returned
  • Mid-stream stall (some bytes arrived, then connection went silent)
  • Vertex honored its server-side deadline and returned 5xx
  • TCP reset / FIN / NAT eviction
  • Our client-side deadline fired against a still-processing server

Without that distinction we cannot decide whether to retry, fail over, escalate, or fix a real bug.

Why the current surface isn't enough

  1. HttpOptions.timeout is a single integer (ms) that becomes aiohttp.ClientTimeout(total=...) in google/genai/_api_client.py (the per-request call to session.request(..., timeout=aiohttp.ClientTimeout(total=http_request.timeout))). Only total is set — no sock_read, no sock_connect, no connect.

  2. aiohttp.ClientTimeout.total alone does not guarantee the deadline fires on a truly silent read — see aio-libs/aiohttp#11740 (maintainer-confirmed Oct 2025). sock_read is required for a hard ceiling.

  3. There is no TraceConfig seam. The SDK builds its own internal aiohttp.ClientSession lazily and never exposes it. We cannot attach hooks for on_connection_create_start, on_request_end, on_response_chunk_received, on_request_exception. Without those, "silent for 3.5 minutes" gives zero signal about what state the transport was in.

  4. Per-request HttpOptions(aiohttp_client=session) doesn't work under ADK. ADK's tracing serializes GenerateContentConfig via model_dump, and a live aiohttp.ClientSession is not pydantic-serializable. Result: PydanticSerializationError.

  5. HttpOptions(async_client_args={'timeout': ...}) raises TypeError — see googleapis/python-genai#1899.

Current workaround

Subclass Gemini and override the api_client @cached_property (per the docstring at google_llm.py:95-112), returning a Client(http_options=HttpOptions(aiohttp_client=<custom session with TraceConfig + per-phase timeouts>)).

This works, but it has real sharp edges:

  • We must manually re-derive _tracking_headers(), retry_options, base_url, api_version, and the vertexai=True branch. Any default api_client adds upstream that we don't mirror, we silently lose. (Tracking headers in particular are easy to drop and hard to notice.)
  • Session lifecycle is on the caller. The session is event-loop-affine, so it must be built inside FastAPI lifespan and closed on shutdown. ADK provides no documented helper or lifespan hook for this — every adopter rediscovers it.
  • The injected session is not visible to ADK telemetry, so per-call diagnostic fields land in our own logs rather than ADK spans.

What we'd like

Any of the following, in roughly preferred order:

  1. Land #4345 (already open, adds custom_api_client / custom_live_api_client constructor params). This removes the need to subclass for the "I built my own Client" case. Related: #2560, #5027.

  2. Expose per-phase timeouts on HttpOptions for callers who don't want to own a full session. Even just HttpOptions(sock_read_timeout=..., sock_connect_timeout=...) would fix the silent-read ceiling issue (point 2 above) without anyone needing to inject a session.

  3. Optional in-flight transport hooks (an ADK-level callback similar to before_model_callback, but firing on transport transitions: on_connect_start, on_request_end, on_first_byte, on_chunk_received, on_request_exception). This is the diagnostic surface the OpenTelemetry GenAI spec marks as TODO and nothing in the ecosystem provides yet.

  4. At minimum, document the session-injection contract more loudly — loop-affinity, lifecycle ownership (_api_client.py:2168-2174 already skips closing user sessions, but this isn't called out anywhere user-facing), and which default kwargs an api_client override must preserve.

Related upstream issues

Environment

  • google-adk: 1.32.0
  • google-genai: 1.75.0
  • aiohttp: 3.12.13
  • Python: 3.12.13
  • Runtime: FastAPI on AWS EKS, Vertex AI (Gemini-3 family)

Metadata

Metadata

Labels

core[Component] This issue is related to the core interface and implementation
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions