[Serve] add debugging metrics to ray serve

## **Autoscaling & Capacity**

| Missing Metric | Prometheus Name (Proposed) | Description | Reason/Debugging Value |
|----------------|---------------------------|-------------|------------------------|
| **Target Replicas** | `ray_serve_deployment_target_replicas` | The target number of replicas the autoscaler wants to reach | Critical for understanding autoscaling lag. "Why aren't we at target?" is unanswerable today. |
| **Autoscaling Decision** | `ray_serve_autoscaling_decision_replicas` | The raw decision from the autoscaling policy before bounds | Debug why autoscaler chose a certain number; identify policy misconfiguration |
| **Total Requests (Autoscaler View)** | `ray_serve_autoscaling_total_requests` | Total requests as seen by the autoscaler | Verify autoscaler's input matches expected load |
| **Replica Autoscaling Metrics Delay** | `ray_serve_autoscaling_replica_metrics_delay_ms` | Time taken for the replica metrics to be reported to controller | Verify busy controller |
| **Handle Autoscaling Metrics Delay** | `ray_serve_autoscaling_handle_metrics_delay_ms` | Time taken for the handle metrics to be reported to controller | Verify busy controller |

## **Request Batching**

| Missing Metric | Prometheus Name (Proposed) | Description | Reason/Debugging Value |
|----------------|---------------------------|-------------|------------------------|
| **Batch Wait Time** | `ray_serve_batch_wait_time_ms` | Time requests waited for batch to fill | Debug latency caused by waiting for batches |
| **Batch Queue Length** | `ray_serve_batch_queue_length` | Number of requests waiting in the batch queue | Identify batching bottleneck vs processing bottleneck |
| **Batch Utilization** | `ray_serve_batch_utilization_percent` | `actual_batch_size / max_batch_size * 100` | Tune `max_batch_size` parameter; low utilization = batch timeout too aggressive |
| **Batches Processed** | `ray_serve_batches_processed_total` | Counter of batches executed | Measure batching throughput separate from request throughput |
| **Batch Execution Time** | `ray_serve_batch_execution_time_ms` |  |  |


## **Latency Breakdown**

| Missing Metric | Prometheus Name (Proposed) | Description | Reason/Debugging Value |
|----------------|---------------------------|-------------|------------------------|
| **Queue Wait Time** | `ray_serve_queue_wait_time_ms` | Time request spent waiting in queue before assignment | **Critical**: Separate queueing delay from processing delay |

## **Replica Health & Lifecycle**

| Missing Metric | Prometheus Name (Proposed) | Description | Reason/Debugging Value |
|----------------|---------------------------|-------------|------------------------|
| **Replica Startup Latency** | `ray_serve_replica_startup_latency_ms` | Time from replica creation to ready state | Debug slow cold starts; model loading time |
| **Replica Initialization Latency** | `serve_replica_initialization_latency_ms` |  | |
| **Replica Reconfigure Latency** | `ray_serve_replica_reconfigure_latency_ms` | Time for replica to complete reconfigure | Debug slow reconfiguration; model loading time |
| **Health Check Latency** | `ray_serve_health_check_latency_ms` | Duration of health check calls | Identify slow health checks blocking scaling |
| **Health Check Failures** | `ray_serve_health_check_failures_total` | Count of failed health checks | Early warning before replica marked unhealthy |
| **Replica Shutdown Duration** | `ray_serve_replica_shutdown_duration_ms` | Time from shutdown signal to replica fully stopped | Debug slow draining during scale-down or rolling updates |

## **Proxy Health**

| Missing Metric | Prometheus Name (Proposed) | Description | Reason/Debugging Value |
|----------------|---------------------------|-------------|------------------------|
| **Proxy Healthy** | `ray_serve_proxy_healthy` | Total number of healthy proxies in system. Tags: `node_id`, `node_ip_address` | Proxy availability |
| **Proxy Draining State** | `ray_serve_proxy_draining` | Whether proxy is draining (1=draining, 0=not). Tags: `node_id`, `node_ip_address` | Visibility during rolling updates |
| **Routing Stats Delay** | `ray_serve_routing_stats_delay_ms` | Time taken for the routing stats to get from replica to controller | Controller performance |

## **State Timeline**

| Missing Metric | Prometheus Name (Proposed) | Description | Reason/Debugging Value |
|----------------|---------------------------|-------------|------------------------|
| **Deployment Status** | `ray_serve_deployment_status` | Numeric status of deployment (0=DEPLOY_FAILED, 1=UNHEALTHY, 2=UPDATING, 3=UPSCALING, 4=DOWNSCALING, 5=HEALTHY). Tags: `deployment`, `application` | State Timeline visualization; deployment lifecycle debugging |
| **Application Status** | `ray_serve_application_status` | Numeric status of application (0=NOT_STARTED, 1=DEPLOYING, 2=DEPLOY_FAILED, 3=RUNNING, 4=UNHEALTHY, 5=DELETING). Tags: `application` | State Timeline visualization; application lifecycle debugging |

## **Long Poll**

| Missing Metric | Prometheus Name (Proposed) | Description | Reason/Debugging Value |
|----------------|---------------------------|-------------|------------------------|
| **Long Poll Latency** | `ray_serve_long_poll_latency_ms` | Time for updates to propagate from controller to clients | Debug slow config propagation; impacts autoscaling response time |
| **Long Poll Pending Clients** | `ray_serve_long_poll_pending_clients` | Number of clients waiting for updates per namespace | Identify backpressure in notification system |


Missing Metric	Prometheus Name (Proposed)	Description	Reason/Debugging Value
Batch Wait Time	`ray_serve_batch_wait_time_ms`	Time requests waited for batch to fill	Debug latency caused by waiting for batches
Batch Queue Length	`ray_serve_batch_queue_length`	Number of requests waiting in the batch queue	Identify batching bottleneck vs processing bottleneck
Batch Utilization	`ray_serve_batch_utilization_percent`	`actual_batch_size / max_batch_size * 100`	Tune `max_batch_size` parameter; low utilization = batch timeout too aggressive
Batches Processed	`ray_serve_batches_processed_total`	Counter of batches executed	Measure batching throughput separate from request throughput
Batch Execution Time	`ray_serve_batch_execution_time_ms`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Serve] add debugging metrics to ray serve #59218

Autoscaling & Capacity

Request Batching

Latency Breakdown

Replica Health & Lifecycle

Proxy Health

State Timeline

Long Poll

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missing Metric	Prometheus Name (Proposed)	Description	Reason/Debugging Value
Target Replicas	`ray_serve_deployment_target_replicas`	The target number of replicas the autoscaler wants to reach	Critical for understanding autoscaling lag. "Why aren't we at target?" is unanswerable today.
Autoscaling Decision	`ray_serve_autoscaling_decision_replicas`	The raw decision from the autoscaling policy before bounds	Debug why autoscaler chose a certain number; identify policy misconfiguration
Total Requests (Autoscaler View)	`ray_serve_autoscaling_total_requests`	Total requests as seen by the autoscaler	Verify autoscaler's input matches expected load
Replica Autoscaling Metrics Delay	`ray_serve_autoscaling_replica_metrics_delay_ms`	Time taken for the replica metrics to be reported to controller	Verify busy controller
Handle Autoscaling Metrics Delay	`ray_serve_autoscaling_handle_metrics_delay_ms`	Time taken for the handle metrics to be reported to controller	Verify busy controller

Missing Metric	Prometheus Name (Proposed)	Description	Reason/Debugging Value
Replica Startup Latency	`ray_serve_replica_startup_latency_ms`	Time from replica creation to ready state	Debug slow cold starts; model loading time
Replica Initialization Latency	`serve_replica_initialization_latency_ms`
Replica Reconfigure Latency	`ray_serve_replica_reconfigure_latency_ms`	Time for replica to complete reconfigure	Debug slow reconfiguration; model loading time
Health Check Latency	`ray_serve_health_check_latency_ms`	Duration of health check calls	Identify slow health checks blocking scaling
Health Check Failures	`ray_serve_health_check_failures_total`	Count of failed health checks	Early warning before replica marked unhealthy
Replica Shutdown Duration	`ray_serve_replica_shutdown_duration_ms`	Time from shutdown signal to replica fully stopped	Debug slow draining during scale-down or rolling updates

Missing Metric	Prometheus Name (Proposed)	Description	Reason/Debugging Value
Proxy Healthy	`ray_serve_proxy_healthy`	Total number of healthy proxies in system. Tags: `node_id`, `node_ip_address`	Proxy availability
Proxy Draining State	`ray_serve_proxy_draining`	Whether proxy is draining (1=draining, 0=not). Tags: `node_id`, `node_ip_address`	Visibility during rolling updates
Routing Stats Delay	`ray_serve_routing_stats_delay_ms`	Time taken for the routing stats to get from replica to controller	Controller performance

Missing Metric	Prometheus Name (Proposed)	Description	Reason/Debugging Value
Deployment Status	`ray_serve_deployment_status`	Numeric status of deployment (0=DEPLOY_FAILED, 1=UNHEALTHY, 2=UPDATING, 3=UPSCALING, 4=DOWNSCALING, 5=HEALTHY). Tags: `deployment`, `application`	State Timeline visualization; deployment lifecycle debugging
Application Status	`ray_serve_application_status`	Numeric status of application (0=NOT_STARTED, 1=DEPLOYING, 2=DEPLOY_FAILED, 3=RUNNING, 4=UNHEALTHY, 5=DELETING). Tags: `application`	State Timeline visualization; application lifecycle debugging

Missing Metric	Prometheus Name (Proposed)	Description	Reason/Debugging Value
Long Poll Latency	`ray_serve_long_poll_latency_ms`	Time for updates to propagate from controller to clients	Debug slow config propagation; impacts autoscaling response time
Long Poll Pending Clients	`ray_serve_long_poll_pending_clients`	Number of clients waiting for updates per namespace	Identify backpressure in notification system

[Serve] add debugging metrics to ray serve #59218

Description

Autoscaling & Capacity

Request Batching

Latency Breakdown

Replica Health & Lifecycle

Proxy Health

State Timeline

Long Poll

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions