bug: ai-proxy-multi health checker creation always fails in timer context due to missing _dns_value

### Description

The `ai-proxy-multi` plugin's health check mechanism has a structural bug: the `construct_upstream` function is called from both request context and timer context (`healthcheck_manager`), but only works correctly in request context.

In timer context, `construct_upstream` always returns `nil` because the `_dns_value` runtime field does not exist on instance configs read from etcd. This causes a Lua runtime error on the subsequent line (`upstream.resource_key`), breaking health check management.

**There are two crash points** — both lack nil checks on the `construct_upstream` return value:

1. **`timer_create_checker`** (line 180): Crashes when creating new checkers if `_dns_value` is missing. In practice, the first creation usually succeeds because the request thread sets `_dns_value` on the shared config cache object before the timer fires.

2. **`timer_working_pool_check`** (line 242/238): Crashes when validating existing checkers after a **config update**. When the route config is updated via Admin API, etcd watch triggers a config cache refresh with a new in-memory object that does NOT have `_dns_value`. The timer calls `construct_upstream` with this fresh config object and crashes. **This is the primary reproducible crash point.**

### Reproduction Steps

**Environment**: Docker (tested on APISIX 3.15.0 and 3.16.0), etcd 3.5

**Step 1**: Create a route with `ai-proxy-multi` and health checks enabled:
```bash
curl http://127.0.0.1:9180/apisix/admin/routes/1 \
  -H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' \
  -X PUT -d '
{
  "uri": "/ai/*",
  "plugins": {
    "ai-proxy-multi": {
      "instances": [
        {
          "name": "healthy-backend",
          "provider": "openai",
          "weight": 1,
          "auth": {"header": {"Authorization": "Bearer test-key"}},
          "endpoint": "http://<healthy-host>:80/v1",
          "checks": {
            "active": {
              "type": "http",
              "http_path": "/",
              "healthy": {"interval": 1, "successes": 1},
              "unhealthy": {"interval": 1, "http_failures": 1}
            }
          }
        },
        {
          "name": "unhealthy-backend",
          "provider": "openai",
          "weight": 1,
          "auth": {"header": {"Authorization": "Bearer test-key"}},
          "endpoint": "http://<unhealthy-host>:80/v1",
          "checks": {
            "active": {
              "type": "http",
              "http_path": "/",
              "healthy": {"interval": 1, "successes": 1},
              "unhealthy": {"interval": 1, "http_failures": 1}
            }
          }
        }
      ]
    }
  }
}'
```

**Step 2**: Send a request to trigger health checker creation:
```bash
curl http://127.0.0.1:9080/ai/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"gpt-4","messages":[{"role":"user","content":"hello"}]}'
```

Wait ~2 seconds. Health checkers are created successfully (timer reads the same config object that the request thread populated with `_dns_value`).

**Step 3**: Update the route config to trigger a config cache refresh:
```bash
curl http://127.0.0.1:9180/apisix/admin/routes/1 \
  -H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' \
  -X PATCH -d '{"desc": "trigger config refresh"}'
```

**Step 4**: Observe the error logs — `timer_working_pool_check` crashes every second:

On APISIX 3.15.0:
```
[error] healthcheck_manager.lua:292: failed to run timer_working_pool_check:
  healthcheck_manager.lua:242: attempt to index local 'upstream' (a nil value), context: ngx.timer
```

On APISIX 3.16.0:
```
[error] healthcheck_manager.lua:288: failed to run timer_working_pool_check:
  healthcheck_manager.lua:238: attempt to index local 'upstream' (a nil value), context: ngx.timer
```

The crash repeats **every second** indefinitely. The existing checkers in `working_pool` can never be properly version-checked or cleaned up.

### Current Behavior

1. When a request hits `pick_target`, `resolve_endpoint` is called which sets `instance._dns_value` (a runtime-only field on the in-memory config object).
2. `fetch_checker` is called, which returns `nil` (checker not yet created) and adds the resource to `waiting_pool`.
3. The `timer_create_checker` timer fires, reads config via `resource.fetch_latest_conf` (returns the same in-memory config object), extracts the instance config via jsonpath, and calls `plugin.construct_upstream(instance_config)`.
4. `construct_upstream` finds `instance._dns_value` (set by the request in step 1 on the shared object) — **initial creation succeeds**.
5. Checker is moved to `working_pool`.
6. **Config update occurs** (route updated via Admin API): etcd watch triggers config cache refresh, creating a **new in-memory config object** without `_dns_value`.
7. `timer_working_pool_check` runs, calls `construct_upstream` with the new config object — **`_dns_value` does not exist** — returns `nil`.
8. The next line `upstream.resource_key = resource_path` crashes: `attempt to index local 'upstream' (a nil value)`. The error is caught by `pcall`.
9. On every subsequent timer tick (every second), the same crash repeats: the timer processes the stuck resource, crashes at the same line, and the error is logged. This creates an **infinite crash loop**.

**Net effect**: After any config update to a route with `ai-proxy-multi` health checks, the health check management timer permanently breaks. Checkers in `working_pool` can never be version-validated or properly cleaned up. The error log fills up with repeated `failed to run timer_working_pool_check` messages every second.

### Expected Behavior

`construct_upstream` should be able to compute the upstream node info from the instance static configuration (endpoint URL or provider defaults) without relying on `_dns_value`, so that health check management works correctly in timer context even after config updates.

### Code References

**`construct_upstream` requiring `_dns_value`** ([ai-proxy-multi.lua#L431-L439](https://github.com/apache/apisix/blob/master/apisix/plugins/ai-proxy-multi.lua#L431-L439)):
```lua
function _M.construct_upstream(instance)
    local upstream = {}
    local node = instance._dns_value
    if not node then
        return nil, "failed to resolve endpoint for instance: " .. instance.name
    end
```

**`_dns_value` is only set in request context by `resolve_endpoint`** ([ai-proxy-multi.lua#L218](https://github.com/apache/apisix/blob/master/apisix/plugins/ai-proxy-multi.lua#L218)):
```lua
instance_conf._dns_value = new_node
```

**Crash point 1: `timer_create_checker`** ([healthcheck_manager.lua#L179-L180](https://github.com/apache/apisix/blob/master/apisix/healthcheck_manager.lua#L179-L180)):
```lua
upstream = plugin.construct_upstream(upstream_constructor_config)  -- returns nil
upstream.resource_key = resource_path  -- CRASH: attempt to index a nil value
```

**Crash point 2: `timer_working_pool_check`** ([healthcheck_manager.lua#L241-L242](https://github.com/apache/apisix/blob/master/apisix/healthcheck_manager.lua#L241-L242)):
```lua
upstream = plugin.construct_upstream(upstream_constructor_config)  -- returns nil after config refresh
upstream.resource_key = resource_path  -- CRASH: attempt to index a nil value
```

### Suggested Fix Direction

Two changes are needed:

1. **`ai-proxy-multi.lua`**: Add a fallback in `construct_upstream` that computes the node from static config (endpoint URL or provider default host/port) when `_dns_value` is not available. The existing `resolve_endpoint` function already contains this logic — it can be extracted into a pure function like `calculate_dns_node(instance_conf)` that returns `{host, port, scheme}` without modifying the input. **Important**: Any fix must preserve the `ai_provider.get_node()` interface used by providers like `vertex-ai` that compute host dynamically (e.g., based on region).

2. **`healthcheck_manager.lua`**: Add a nil check after `construct_upstream` returns, before accessing `upstream.resource_key`. This prevents the crash and allows graceful handling when a plugin `construct_upstream` fails.

### Environment

- **Bug confirmed on**: APISIX 3.15.0 and 3.16.0 (latest release)
- Affects all deployment modes where `ai-proxy-multi` is used with health checks enabled
- Reproduction confirmed with Docker (`apache/apisix:3.15.0-debian`, `apache/apisix:3.16.0-debian` + `bitnamilegacy/etcd:3.5`)

### Context

This issue was identified during analysis of [PR #12968](https://github.com/apache/apisix/pull/12968), which attempts to fix this problem but has additional issues (removes `get_node` support, couples to `resty.healthcheck` SHM internals, includes unrelated changes). This issue is filed to track the core bug independently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: ai-proxy-multi health checker creation always fails in timer context due to missing _dns_value #13101

Description

Reproduction Steps

Current Behavior

Expected Behavior

Code References

Suggested Fix Direction

Environment

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bug: ai-proxy-multi health checker creation always fails in timer context due to missing _dns_value #13101

Description

Description

Reproduction Steps

Current Behavior

Expected Behavior

Code References

Suggested Fix Direction

Environment

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions