Description
The ai-proxy-multi plugin's health check mechanism has a structural bug: the construct_upstream function is called from both request context and timer context (healthcheck_manager), but only works correctly in request context.
In timer context, construct_upstream always returns nil because the _dns_value runtime field does not exist on instance configs read from etcd. This causes a Lua runtime error on the subsequent line (upstream.resource_key), breaking health check management.
There are two crash points — both lack nil checks on the construct_upstream return value:
-
timer_create_checker (line 180): Crashes when creating new checkers if _dns_value is missing. In practice, the first creation usually succeeds because the request thread sets _dns_value on the shared config cache object before the timer fires.
-
timer_working_pool_check (line 242/238): Crashes when validating existing checkers after a config update. When the route config is updated via Admin API, etcd watch triggers a config cache refresh with a new in-memory object that does NOT have _dns_value. The timer calls construct_upstream with this fresh config object and crashes. This is the primary reproducible crash point.
Reproduction Steps
Environment: Docker (tested on APISIX 3.15.0 and 3.16.0), etcd 3.5
Step 1: Create a route with ai-proxy-multi and health checks enabled:
curl http://127.0.0.1:9180/apisix/admin/routes/1 \
-H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' \
-X PUT -d '
{
"uri": "/ai/*",
"plugins": {
"ai-proxy-multi": {
"instances": [
{
"name": "healthy-backend",
"provider": "openai",
"weight": 1,
"auth": {"header": {"Authorization": "Bearer test-key"}},
"endpoint": "http://<healthy-host>:80/v1",
"checks": {
"active": {
"type": "http",
"http_path": "/",
"healthy": {"interval": 1, "successes": 1},
"unhealthy": {"interval": 1, "http_failures": 1}
}
}
},
{
"name": "unhealthy-backend",
"provider": "openai",
"weight": 1,
"auth": {"header": {"Authorization": "Bearer test-key"}},
"endpoint": "http://<unhealthy-host>:80/v1",
"checks": {
"active": {
"type": "http",
"http_path": "/",
"healthy": {"interval": 1, "successes": 1},
"unhealthy": {"interval": 1, "http_failures": 1}
}
}
}
]
}
}
}'
Step 2: Send a request to trigger health checker creation:
curl http://127.0.0.1:9080/ai/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"gpt-4","messages":[{"role":"user","content":"hello"}]}'
Wait ~2 seconds. Health checkers are created successfully (timer reads the same config object that the request thread populated with _dns_value).
Step 3: Update the route config to trigger a config cache refresh:
curl http://127.0.0.1:9180/apisix/admin/routes/1 \
-H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' \
-X PATCH -d '{"desc": "trigger config refresh"}'
Step 4: Observe the error logs — timer_working_pool_check crashes every second:
On APISIX 3.15.0:
[error] healthcheck_manager.lua:292: failed to run timer_working_pool_check:
healthcheck_manager.lua:242: attempt to index local 'upstream' (a nil value), context: ngx.timer
On APISIX 3.16.0:
[error] healthcheck_manager.lua:288: failed to run timer_working_pool_check:
healthcheck_manager.lua:238: attempt to index local 'upstream' (a nil value), context: ngx.timer
The crash repeats every second indefinitely. The existing checkers in working_pool can never be properly version-checked or cleaned up.
Current Behavior
- When a request hits
pick_target, resolve_endpoint is called which sets instance._dns_value (a runtime-only field on the in-memory config object).
fetch_checker is called, which returns nil (checker not yet created) and adds the resource to waiting_pool.
- The
timer_create_checker timer fires, reads config via resource.fetch_latest_conf (returns the same in-memory config object), extracts the instance config via jsonpath, and calls plugin.construct_upstream(instance_config).
construct_upstream finds instance._dns_value (set by the request in step 1 on the shared object) — initial creation succeeds.
- Checker is moved to
working_pool.
- Config update occurs (route updated via Admin API): etcd watch triggers config cache refresh, creating a new in-memory config object without
_dns_value.
timer_working_pool_check runs, calls construct_upstream with the new config object — _dns_value does not exist — returns nil.
- The next line
upstream.resource_key = resource_path crashes: attempt to index local 'upstream' (a nil value). The error is caught by pcall.
- On every subsequent timer tick (every second), the same crash repeats: the timer processes the stuck resource, crashes at the same line, and the error is logged. This creates an infinite crash loop.
Net effect: After any config update to a route with ai-proxy-multi health checks, the health check management timer permanently breaks. Checkers in working_pool can never be version-validated or properly cleaned up. The error log fills up with repeated failed to run timer_working_pool_check messages every second.
Expected Behavior
construct_upstream should be able to compute the upstream node info from the instance static configuration (endpoint URL or provider defaults) without relying on _dns_value, so that health check management works correctly in timer context even after config updates.
Code References
construct_upstream requiring _dns_value (ai-proxy-multi.lua#L431-L439):
function _M.construct_upstream(instance)
local upstream = {}
local node = instance._dns_value
if not node then
return nil, "failed to resolve endpoint for instance: " .. instance.name
end
_dns_value is only set in request context by resolve_endpoint (ai-proxy-multi.lua#L218):
instance_conf._dns_value = new_node
Crash point 1: timer_create_checker (healthcheck_manager.lua#L179-L180):
upstream = plugin.construct_upstream(upstream_constructor_config) -- returns nil
upstream.resource_key = resource_path -- CRASH: attempt to index a nil value
Crash point 2: timer_working_pool_check (healthcheck_manager.lua#L241-L242):
upstream = plugin.construct_upstream(upstream_constructor_config) -- returns nil after config refresh
upstream.resource_key = resource_path -- CRASH: attempt to index a nil value
Suggested Fix Direction
Two changes are needed:
-
ai-proxy-multi.lua: Add a fallback in construct_upstream that computes the node from static config (endpoint URL or provider default host/port) when _dns_value is not available. The existing resolve_endpoint function already contains this logic — it can be extracted into a pure function like calculate_dns_node(instance_conf) that returns {host, port, scheme} without modifying the input. Important: Any fix must preserve the ai_provider.get_node() interface used by providers like vertex-ai that compute host dynamically (e.g., based on region).
-
healthcheck_manager.lua: Add a nil check after construct_upstream returns, before accessing upstream.resource_key. This prevents the crash and allows graceful handling when a plugin construct_upstream fails.
Environment
- Bug confirmed on: APISIX 3.15.0 and 3.16.0 (latest release)
- Affects all deployment modes where
ai-proxy-multi is used with health checks enabled
- Reproduction confirmed with Docker (
apache/apisix:3.15.0-debian, apache/apisix:3.16.0-debian + bitnamilegacy/etcd:3.5)
Context
This issue was identified during analysis of PR #12968, which attempts to fix this problem but has additional issues (removes get_node support, couples to resty.healthcheck SHM internals, includes unrelated changes). This issue is filed to track the core bug independently.
Description
The
ai-proxy-multiplugin's health check mechanism has a structural bug: theconstruct_upstreamfunction is called from both request context and timer context (healthcheck_manager), but only works correctly in request context.In timer context,
construct_upstreamalways returnsnilbecause the_dns_valueruntime field does not exist on instance configs read from etcd. This causes a Lua runtime error on the subsequent line (upstream.resource_key), breaking health check management.There are two crash points — both lack nil checks on the
construct_upstreamreturn value:timer_create_checker(line 180): Crashes when creating new checkers if_dns_valueis missing. In practice, the first creation usually succeeds because the request thread sets_dns_valueon the shared config cache object before the timer fires.timer_working_pool_check(line 242/238): Crashes when validating existing checkers after a config update. When the route config is updated via Admin API, etcd watch triggers a config cache refresh with a new in-memory object that does NOT have_dns_value. The timer callsconstruct_upstreamwith this fresh config object and crashes. This is the primary reproducible crash point.Reproduction Steps
Environment: Docker (tested on APISIX 3.15.0 and 3.16.0), etcd 3.5
Step 1: Create a route with
ai-proxy-multiand health checks enabled:Step 2: Send a request to trigger health checker creation:
Wait ~2 seconds. Health checkers are created successfully (timer reads the same config object that the request thread populated with
_dns_value).Step 3: Update the route config to trigger a config cache refresh:
Step 4: Observe the error logs —
timer_working_pool_checkcrashes every second:On APISIX 3.15.0:
On APISIX 3.16.0:
The crash repeats every second indefinitely. The existing checkers in
working_poolcan never be properly version-checked or cleaned up.Current Behavior
pick_target,resolve_endpointis called which setsinstance._dns_value(a runtime-only field on the in-memory config object).fetch_checkeris called, which returnsnil(checker not yet created) and adds the resource towaiting_pool.timer_create_checkertimer fires, reads config viaresource.fetch_latest_conf(returns the same in-memory config object), extracts the instance config via jsonpath, and callsplugin.construct_upstream(instance_config).construct_upstreamfindsinstance._dns_value(set by the request in step 1 on the shared object) — initial creation succeeds.working_pool._dns_value.timer_working_pool_checkruns, callsconstruct_upstreamwith the new config object —_dns_valuedoes not exist — returnsnil.upstream.resource_key = resource_pathcrashes:attempt to index local 'upstream' (a nil value). The error is caught bypcall.Net effect: After any config update to a route with
ai-proxy-multihealth checks, the health check management timer permanently breaks. Checkers inworking_poolcan never be version-validated or properly cleaned up. The error log fills up with repeatedfailed to run timer_working_pool_checkmessages every second.Expected Behavior
construct_upstreamshould be able to compute the upstream node info from the instance static configuration (endpoint URL or provider defaults) without relying on_dns_value, so that health check management works correctly in timer context even after config updates.Code References
construct_upstreamrequiring_dns_value(ai-proxy-multi.lua#L431-L439):_dns_valueis only set in request context byresolve_endpoint(ai-proxy-multi.lua#L218):Crash point 1:
timer_create_checker(healthcheck_manager.lua#L179-L180):Crash point 2:
timer_working_pool_check(healthcheck_manager.lua#L241-L242):Suggested Fix Direction
Two changes are needed:
ai-proxy-multi.lua: Add a fallback inconstruct_upstreamthat computes the node from static config (endpoint URL or provider default host/port) when_dns_valueis not available. The existingresolve_endpointfunction already contains this logic — it can be extracted into a pure function likecalculate_dns_node(instance_conf)that returns{host, port, scheme}without modifying the input. Important: Any fix must preserve theai_provider.get_node()interface used by providers likevertex-aithat compute host dynamically (e.g., based on region).healthcheck_manager.lua: Add a nil check afterconstruct_upstreamreturns, before accessingupstream.resource_key. This prevents the crash and allows graceful handling when a pluginconstruct_upstreamfails.Environment
ai-proxy-multiis used with health checks enabledapache/apisix:3.15.0-debian,apache/apisix:3.16.0-debian+bitnamilegacy/etcd:3.5)Context
This issue was identified during analysis of PR #12968, which attempts to fix this problem but has additional issues (removes
get_nodesupport, couples toresty.healthcheckSHM internals, includes unrelated changes). This issue is filed to track the core bug independently.