-
Notifications
You must be signed in to change notification settings - Fork 7k
[Core] Adding the node id to the base event #59242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Core] Adding the node id to the base event #59242
Conversation
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds the node_id to the base event, which is a valuable addition for observability, especially in scenarios involving unexpected process termination. The changes are extensive, plumbing the node_id through various layers of the system, from Python service startup scripts to the core C++ components and protobuf definitions. The implementation appears solid and consistent. I've made a few minor suggestions to improve code robustness by using const for member variables that are initialized once and never modified. Overall, this is a well-executed and important change.
| ActorLifecycleEvent actor_lifecycle_event = 17; | ||
|
|
||
| // Unique identifier of the node | ||
| bytes node_id = 18; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Proto file modified - review RPC fault tolerance guide (Bugbot Rules)
.proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
|
For most of the test changes, I just pass in the |
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Description
When the Ray process unexpected terminated and node id changed, we cannot know the previous and current node id through event as we did not include
node_idin the base event.This PR add the
node_idinto the base event showing where the event is being emitted.Main changes
src/ray/protobuf/public/events_base_event.protoRayEvent)For GCS:
src/ray/gcs/gcs_server_main.cc--node_idas cli argssrc/ray/observability/andsrc/ray/gcs/(some files)node_idas arguments and pass toRayEventFor CoreWorker
src/ray/core_worker/node_idto theRayEventPython side
python/ray/_private/node.pynode_idwhen starting gcs serverRelated issues
Closes #58879
Additional information
Testing process
RAY_enable_core_worker_ray_event_to_aggregator=1RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR="http://localhost:8000"ray start --head --system-config='{"enable_ray_event":true}'ray job submit -- python rayjob.py. E.g.python event_listener.pyWill get the event as below:
[ { "eventId":"A+yrzknbyDjQALBvaimTPcyZWll9Te3Tw+FEnQ==", "sourceType":"GCS", "eventType":"DRIVER_JOB_LIFECYCLE_EVENT", "timestamp":"2025-12-07T10: 54: 12.621560Z", "severity":"INFO", "sessionName":"session_2025-12-07_17-33-33_853835_27993", "driverJobLifecycleEvent":{ "jobId":"BAAAAA==", "stateTransitions":[ { "state":"FINISHED", "timestamp":"2025-12-07T10: 54: 12.621562Z" } ] }, "nodeId":"k4hj3FDLYStB38nSSRZQRwOjEV32EoAjQe3KPw==", // <- nodeId set "message":"" } ][ { "eventId":"TIAp8D4NwN/ne3VhPHQ0QnsBCYSkZmOUWoe6zQ==", "sourceType":"CORE_WORKER", "eventType":"TASK_DEFINITION_EVENT", "timestamp":"2025-12-07T10:54:12.025967Z", "severity":"INFO", "sessionName":"session_2025-12-07_17-33-33_853835_27993", "taskDefinitionEvent":{ "taskId":"yoDzqOi6LlD///////////////8EAAAA", "taskFunc":{ "pythonFunctionDescriptor":{ "moduleName":"rayjob", "functionName":"hello_world", "functionHash":"a37aacc3b7884c2da4aec32db6151d65", "className":"" } }, "taskName":"hello_world", "requiredResources":{ "CPU":1.0 }, "jobId":"BAAAAA==", "parentTaskId":"//////////////////////////8EAAAA", "placementGroupId":"////////////////////////", "serializedRuntimeEnv":"{}", "taskAttempt":0, "taskType":"NORMAL_TASK", "language":"PYTHON", "refIds":{ } }, "nodeId":"k4hj3FDLYStB38nSSRZQRwOjEV32EoAjQe3KPw==", // <- nodeId set here "message":"" } ]