Skip to content

[Serve] add debugging metrics to ray serve #59218

@abrarsheikh

Description

@abrarsheikh

Autoscaling & Capacity

Missing Metric Prometheus Name (Proposed) Description Reason/Debugging Value
Target Replicas ray_serve_deployment_target_replicas The target number of replicas the autoscaler wants to reach Critical for understanding autoscaling lag. "Why aren't we at target?" is unanswerable today.
Autoscaling Decision ray_serve_autoscaling_decision_replicas The raw decision from the autoscaling policy before bounds Debug why autoscaler chose a certain number; identify policy misconfiguration
Total Requests (Autoscaler View) ray_serve_autoscaling_total_requests Total requests as seen by the autoscaler Verify autoscaler's input matches expected load
Replica Autoscaling Metrics Delay ray_serve_autoscaling_replica_metrics_delay_ms Time taken for the replica metrics to be reported to controller Verify busy controller
Handle Autoscaling Metrics Delay ray_serve_autoscaling_handle_metrics_delay_ms Time taken for the handle metrics to be reported to controller Verify busy controller

Request Batching

Missing Metric Prometheus Name (Proposed) Description Reason/Debugging Value
Batch Wait Time ray_serve_batch_wait_time_ms Time requests waited for batch to fill Debug latency caused by waiting for batches
Batch Queue Length ray_serve_batch_queue_length Number of requests waiting in the batch queue Identify batching bottleneck vs processing bottleneck
Batch Utilization ray_serve_batch_utilization_percent actual_batch_size / max_batch_size * 100 Tune max_batch_size parameter; low utilization = batch timeout too aggressive
Batches Processed ray_serve_batches_processed_total Counter of batches executed Measure batching throughput separate from request throughput
Batch Execution Time ray_serve_batch_execution_time_ms

Latency Breakdown

Missing Metric Prometheus Name (Proposed) Description Reason/Debugging Value
Queue Wait Time ray_serve_queue_wait_time_ms Time request spent waiting in queue before assignment Critical: Separate queueing delay from processing delay

Replica Health & Lifecycle

Missing Metric Prometheus Name (Proposed) Description Reason/Debugging Value
Replica Startup Latency ray_serve_replica_startup_latency_ms Time from replica creation to ready state Debug slow cold starts; model loading time
Replica Initialization Latency serve_replica_initialization_latency_ms
Replica Reconfigure Latency ray_serve_replica_reconfigure_latency_ms Time for replica to complete reconfigure Debug slow reconfiguration; model loading time
Health Check Latency ray_serve_health_check_latency_ms Duration of health check calls Identify slow health checks blocking scaling
Health Check Failures ray_serve_health_check_failures_total Count of failed health checks Early warning before replica marked unhealthy
Replica Shutdown Duration ray_serve_replica_shutdown_duration_ms Time from shutdown signal to replica fully stopped Debug slow draining during scale-down or rolling updates

Proxy Health

Missing Metric Prometheus Name (Proposed) Description Reason/Debugging Value
Proxy Healthy ray_serve_proxy_healthy Total number of healthy proxies in system. Tags: node_id, node_ip_address Proxy availability
Proxy Draining State ray_serve_proxy_draining Whether proxy is draining (1=draining, 0=not). Tags: node_id, node_ip_address Visibility during rolling updates
Routing Stats Delay ray_serve_routing_stats_delay_ms Time taken for the routing stats to get from replica to controller Controller performance

State Timeline

Missing Metric Prometheus Name (Proposed) Description Reason/Debugging Value
Deployment Status ray_serve_deployment_status Numeric status of deployment (0=DEPLOY_FAILED, 1=UNHEALTHY, 2=UPDATING, 3=UPSCALING, 4=DOWNSCALING, 5=HEALTHY). Tags: deployment, application State Timeline visualization; deployment lifecycle debugging
Application Status ray_serve_application_status Numeric status of application (0=NOT_STARTED, 1=DEPLOYING, 2=DEPLOY_FAILED, 3=RUNNING, 4=UNHEALTHY, 5=DELETING). Tags: application State Timeline visualization; application lifecycle debugging

Long Poll

Missing Metric Prometheus Name (Proposed) Description Reason/Debugging Value
Long Poll Latency ray_serve_long_poll_latency_ms Time for updates to propagate from controller to clients Debug slow config propagation; impacts autoscaling response time
Long Poll Pending Clients ray_serve_long_poll_pending_clients Number of clients waiting for updates per namespace Identify backpressure in notification system

Metadata

Metadata

Assignees

Labels

observabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingserveRay Serve Related Issueusability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions