Skip to content

Conversation

@machichima
Copy link
Contributor

@machichima machichima commented Dec 7, 2025

Description

When the Ray process unexpected terminated and node id changed, we cannot know the previous and current node id through event as we did not include node_id in the base event.

This PR add the node_id into the base event showing where the event is being emitted.

Main changes

  • src/ray/protobuf/public/events_base_event.proto
    • Add node id to the base event proto (RayEvent)

For GCS:

  • src/ray/gcs/gcs_server_main.cc
    • add --node_id as cli args
  • src/ray/observability/ and src/ray/gcs/ (some files)
    • Add node_id as arguments and pass to RayEvent

For CoreWorker

  • src/ray/core_worker/
    • Passing the node_id to the RayEvent

Python side

  • python/ray/_private/node.py
    • Passing node_id when starting gcs server

Related issues

Closes #58879

Additional information

Testing process

  1. export env var:
    • RAY_enable_core_worker_ray_event_to_aggregator=1
    • RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR="http://localhost:8000"
  2. ray start --head --system-config='{"enable_ray_event":true}'
  3. Submit simple job ray job submit -- python rayjob.py. E.g.
import ray

@ray.remote
def hello_world():
    return "hello world"

ray.init()
print(ray.get(hello_world.remote()))
  1. Run event listener (script below) to start listening the event export host python event_listener.py
import http.server
import socketserver
import json
import logging

PORT = 8000

class EventReceiver(http.server.SimpleHTTPRequestHandler):
    def do_POST(self):
        content_length = int(self.headers['Content-Length'])
        post_data = self.rfile.read(content_length)
        print(json.loads(post_data.decode('utf-8')))
        self.send_response(200)
        self.send_header('Content-type', 'application/json')
        self.end_headers()
        self.wfile.write(json.dumps({"status": "success", "message": "Event received"}).encode('utf-8'))


if __name__ == "__main__":
    with socketserver.TCPServer(("", PORT), EventReceiver) as httpd:
        print(f"Serving event listener on http://localhost:{PORT}")
        print("Set RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR to this address.")
        httpd.serve_forever()

Will get the event as below:

  • GCS event:
[
   {
      "eventId":"A+yrzknbyDjQALBvaimTPcyZWll9Te3Tw+FEnQ==",
      "sourceType":"GCS",
      "eventType":"DRIVER_JOB_LIFECYCLE_EVENT",
      "timestamp":"2025-12-07T10: 54: 12.621560Z",
      "severity":"INFO",
      "sessionName":"session_2025-12-07_17-33-33_853835_27993",
      "driverJobLifecycleEvent":{
         "jobId":"BAAAAA==",
         "stateTransitions":[
            {
               "state":"FINISHED",
               "timestamp":"2025-12-07T10: 54: 12.621562Z"
            }
         ]
      },
      "nodeId":"k4hj3FDLYStB38nSSRZQRwOjEV32EoAjQe3KPw==",   // <- nodeId set
      "message":""
   }
]
  • CoreWorker event:
[
   {
      "eventId":"TIAp8D4NwN/ne3VhPHQ0QnsBCYSkZmOUWoe6zQ==",
      "sourceType":"CORE_WORKER",
      "eventType":"TASK_DEFINITION_EVENT",
      "timestamp":"2025-12-07T10:54:12.025967Z",
      "severity":"INFO",
      "sessionName":"session_2025-12-07_17-33-33_853835_27993",
      "taskDefinitionEvent":{
         "taskId":"yoDzqOi6LlD///////////////8EAAAA",
         "taskFunc":{
            "pythonFunctionDescriptor":{
               "moduleName":"rayjob",
               "functionName":"hello_world",
               "functionHash":"a37aacc3b7884c2da4aec32db6151d65",
               "className":""
            }
         },
         "taskName":"hello_world",
         "requiredResources":{
            "CPU":1.0
         },
         "jobId":"BAAAAA==",
         "parentTaskId":"//////////////////////////8EAAAA",
         "placementGroupId":"////////////////////////",
         "serializedRuntimeEnv":"{}",
         "taskAttempt":0,
         "taskType":"NORMAL_TASK",
         "language":"PYTHON",
         "refIds":{
            
         }
      },
      "nodeId":"k4hj3FDLYStB38nSSRZQRwOjEV32EoAjQe3KPw==",   // <- nodeId set here
      "message":""
   }
]

@machichima machichima requested review from a team, edoakes and jjyao as code owners December 7, 2025 10:13
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds the node_id to the base event, which is a valuable addition for observability, especially in scenarios involving unexpected process termination. The changes are extensive, plumbing the node_id through various layers of the system, from Python service startup scripts to the core C++ components and protobuf definitions. The implementation appears solid and consistent. I've made a few minor suggestions to improve code robustness by using const for member variables that are initialized once and never modified. Overall, this is a well-executed and important change.

ActorLifecycleEvent actor_lifecycle_event = 17;

// Unique identifier of the node
bytes node_id = 18;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Proto file modified - review RPC fault tolerance guide (Bugbot Rules)

⚠️ This PR modifies one or more .proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Dec 7, 2025
@machichima
Copy link
Contributor Author

For most of the test changes, I just pass in the node_id to the Event class constructor. Added the assertion for checking if node_id is set correctly in src/ray/gcs/tests/gcs_node_manager_test.cc

@machichima machichima requested a review from a team as a code owner December 8, 2025 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] Adding the node id to the base event

1 participant