Skip to content

bug: ai-proxy-multi health checker creation always fails in timer context due to missing _dns_value #13101

@Baoyuantop

Description

@Baoyuantop

Description

The ai-proxy-multi plugin's health check mechanism has a structural bug: the construct_upstream function is called from both request context and timer context (healthcheck_manager), but only works correctly in request context.

In timer context, construct_upstream always returns nil because the _dns_value runtime field does not exist on instance configs read from etcd. This causes a Lua runtime error on the subsequent line (upstream.resource_key), breaking health check management.

There are two crash points — both lack nil checks on the construct_upstream return value:

  1. timer_create_checker (line 180): Crashes when creating new checkers if _dns_value is missing. In practice, the first creation usually succeeds because the request thread sets _dns_value on the shared config cache object before the timer fires.

  2. timer_working_pool_check (line 242/238): Crashes when validating existing checkers after a config update. When the route config is updated via Admin API, etcd watch triggers a config cache refresh with a new in-memory object that does NOT have _dns_value. The timer calls construct_upstream with this fresh config object and crashes. This is the primary reproducible crash point.

Reproduction Steps

Environment: Docker (tested on APISIX 3.15.0 and 3.16.0), etcd 3.5

Step 1: Create a route with ai-proxy-multi and health checks enabled:

curl http://127.0.0.1:9180/apisix/admin/routes/1 \
  -H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' \
  -X PUT -d '
{
  "uri": "/ai/*",
  "plugins": {
    "ai-proxy-multi": {
      "instances": [
        {
          "name": "healthy-backend",
          "provider": "openai",
          "weight": 1,
          "auth": {"header": {"Authorization": "Bearer test-key"}},
          "endpoint": "http://<healthy-host>:80/v1",
          "checks": {
            "active": {
              "type": "http",
              "http_path": "/",
              "healthy": {"interval": 1, "successes": 1},
              "unhealthy": {"interval": 1, "http_failures": 1}
            }
          }
        },
        {
          "name": "unhealthy-backend",
          "provider": "openai",
          "weight": 1,
          "auth": {"header": {"Authorization": "Bearer test-key"}},
          "endpoint": "http://<unhealthy-host>:80/v1",
          "checks": {
            "active": {
              "type": "http",
              "http_path": "/",
              "healthy": {"interval": 1, "successes": 1},
              "unhealthy": {"interval": 1, "http_failures": 1}
            }
          }
        }
      ]
    }
  }
}'

Step 2: Send a request to trigger health checker creation:

curl http://127.0.0.1:9080/ai/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"gpt-4","messages":[{"role":"user","content":"hello"}]}'

Wait ~2 seconds. Health checkers are created successfully (timer reads the same config object that the request thread populated with _dns_value).

Step 3: Update the route config to trigger a config cache refresh:

curl http://127.0.0.1:9180/apisix/admin/routes/1 \
  -H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' \
  -X PATCH -d '{"desc": "trigger config refresh"}'

Step 4: Observe the error logs — timer_working_pool_check crashes every second:

On APISIX 3.15.0:

[error] healthcheck_manager.lua:292: failed to run timer_working_pool_check:
  healthcheck_manager.lua:242: attempt to index local 'upstream' (a nil value), context: ngx.timer

On APISIX 3.16.0:

[error] healthcheck_manager.lua:288: failed to run timer_working_pool_check:
  healthcheck_manager.lua:238: attempt to index local 'upstream' (a nil value), context: ngx.timer

The crash repeats every second indefinitely. The existing checkers in working_pool can never be properly version-checked or cleaned up.

Current Behavior

  1. When a request hits pick_target, resolve_endpoint is called which sets instance._dns_value (a runtime-only field on the in-memory config object).
  2. fetch_checker is called, which returns nil (checker not yet created) and adds the resource to waiting_pool.
  3. The timer_create_checker timer fires, reads config via resource.fetch_latest_conf (returns the same in-memory config object), extracts the instance config via jsonpath, and calls plugin.construct_upstream(instance_config).
  4. construct_upstream finds instance._dns_value (set by the request in step 1 on the shared object) — initial creation succeeds.
  5. Checker is moved to working_pool.
  6. Config update occurs (route updated via Admin API): etcd watch triggers config cache refresh, creating a new in-memory config object without _dns_value.
  7. timer_working_pool_check runs, calls construct_upstream with the new config object — _dns_value does not exist — returns nil.
  8. The next line upstream.resource_key = resource_path crashes: attempt to index local 'upstream' (a nil value). The error is caught by pcall.
  9. On every subsequent timer tick (every second), the same crash repeats: the timer processes the stuck resource, crashes at the same line, and the error is logged. This creates an infinite crash loop.

Net effect: After any config update to a route with ai-proxy-multi health checks, the health check management timer permanently breaks. Checkers in working_pool can never be version-validated or properly cleaned up. The error log fills up with repeated failed to run timer_working_pool_check messages every second.

Expected Behavior

construct_upstream should be able to compute the upstream node info from the instance static configuration (endpoint URL or provider defaults) without relying on _dns_value, so that health check management works correctly in timer context even after config updates.

Code References

construct_upstream requiring _dns_value (ai-proxy-multi.lua#L431-L439):

function _M.construct_upstream(instance)
    local upstream = {}
    local node = instance._dns_value
    if not node then
        return nil, "failed to resolve endpoint for instance: " .. instance.name
    end

_dns_value is only set in request context by resolve_endpoint (ai-proxy-multi.lua#L218):

instance_conf._dns_value = new_node

Crash point 1: timer_create_checker (healthcheck_manager.lua#L179-L180):

upstream = plugin.construct_upstream(upstream_constructor_config)  -- returns nil
upstream.resource_key = resource_path  -- CRASH: attempt to index a nil value

Crash point 2: timer_working_pool_check (healthcheck_manager.lua#L241-L242):

upstream = plugin.construct_upstream(upstream_constructor_config)  -- returns nil after config refresh
upstream.resource_key = resource_path  -- CRASH: attempt to index a nil value

Suggested Fix Direction

Two changes are needed:

  1. ai-proxy-multi.lua: Add a fallback in construct_upstream that computes the node from static config (endpoint URL or provider default host/port) when _dns_value is not available. The existing resolve_endpoint function already contains this logic — it can be extracted into a pure function like calculate_dns_node(instance_conf) that returns {host, port, scheme} without modifying the input. Important: Any fix must preserve the ai_provider.get_node() interface used by providers like vertex-ai that compute host dynamically (e.g., based on region).

  2. healthcheck_manager.lua: Add a nil check after construct_upstream returns, before accessing upstream.resource_key. This prevents the crash and allows graceful handling when a plugin construct_upstream fails.

Environment

  • Bug confirmed on: APISIX 3.15.0 and 3.16.0 (latest release)
  • Affects all deployment modes where ai-proxy-multi is used with health checks enabled
  • Reproduction confirmed with Docker (apache/apisix:3.15.0-debian, apache/apisix:3.16.0-debian + bitnamilegacy/etcd:3.5)

Context

This issue was identified during analysis of PR #12968, which attempts to fix this problem but has additional issues (removes get_node support, couples to resty.healthcheck SHM internals, includes unrelated changes). This issue is filed to track the core bug independently.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingplugin

Type

No type
No fields configured for issues without a type.

Projects

Status

🏗 In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions