Commit 0617445
[Fix][Core] Close unused pipe file descriptor of child processes of Raylet (#52700)
- We rely on a pipe-based mechanism to monitor the parent process
(raylet)'s health:
https://github.com/ray-project/ray/blob/07a00f20bb830c61ccf8fb2fddcfae8fa4c418dd/python/ray/_private/process_watcher.py#L127-L149
- Ideally, when raylet dies, the OS will close all its file descriptors.
With no process holding the write end of the pipe, so the `readline()`
in
https://github.com/ray-project/ray/blob/07a00f20bb830c61ccf8fb2fddcfae8fa4c418dd/python/ray/_private/process_watcher.py#L138
will return an empty string, indicating that the raylet has exited.
- However, this only works if *no other process* holds the pipe's write
end.
- Raylet launches several child processes via `fork + exec`. For
example, `DashboardAgent` and `RuntimeEnvAgent`:
https://github.com/ray-project/ray/blob/c2a6de384217e5db6f055d0607ca3e531deed56c/src/ray/raylet/node_manager.cc#L392-L393
- These child processes are spawned using `ProcessFD::spawnvpe`:
https://github.com/ray-project/ray/blob/a80f02f2c1c53d79eb94ecafaf59ae51f87d0734/src/ray/util/process.cc#L130
This function sets up a pipe and closes unused ends in both parent and
child:
https://github.com/ray-project/ray/blob/6ca0da17300e6087ae6bc9cb9faaa7e9a071b33e/src/ray/util/process.cc#L178-L226
- Raylet (parent process) creates `DashboardAgent` first, holding the
write end (`parent_lifetime_pipe[1]`, say fd 40), and closes the read
end.
- The dashboard agent (child process 1) holds the read end
(`parent_lifetime_pipe[0]`, say fd 39), and closes the write end.
- When raylet later forks `RuntimeEnvAgent` (child process 2), all open
fds (including fd 40) are inherited by the new child process.
- As a result, even if raylet dies, `RuntimeEnvAgent` still holds fd 40
(write end), preventing `DashboardAgent` from detecting raylet's death.
- Therefore, `readline()` in the dashboard agent hangs forever instead
of returning an empty string.
**Solution:** Ensure unused pipe write ends are closed on `exec` by
setting `FD_CLOEXEC`.
### Changes in this PR
- Revert the changes in #52388
because it is only a hotfix.
- Move `SetFdCloseOnExec` to `process.h` and `process.cc` and make it a
public util function.
- Call `SetFdCloseOnExec` in the parent process to make sure the pipe fd
is closed in subsequent `exec`.
### Testing
Tested manually. Before this PR, in about 20% of the runs the agent
process did not exit after the raylet died. After this PR, I ran it 20
times consecutively and the agent process successfully exited in all
runs.
---------
Signed-off-by: Chi-Sheng Liu <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>1 parent 2bcea58 commit 0617445
File tree
4 files changed
+36
-39
lines changed- python/ray/dashboard/modules/job/tests
- src/ray
- common
- util
4 files changed
+36
-39
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
197 | 197 | | |
198 | 198 | | |
199 | 199 | | |
200 | | - | |
201 | | - | |
202 | | - | |
| 200 | + | |
203 | 201 | | |
204 | 202 | | |
205 | 203 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
| 33 | + | |
33 | 34 | | |
34 | 35 | | |
35 | 36 | | |
| |||
42 | 43 | | |
43 | 44 | | |
44 | 45 | | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | | - | |
80 | | - | |
81 | 46 | | |
| 47 | + | |
82 | 48 | | |
| 49 | + | |
83 | 50 | | |
84 | 51 | | |
85 | 52 | | |
| 53 | + | |
86 | 54 | | |
| 55 | + | |
87 | 56 | | |
88 | 57 | | |
89 | 58 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| 25 | + | |
25 | 26 | | |
26 | 27 | | |
27 | 28 | | |
| |||
68 | 69 | | |
69 | 70 | | |
70 | 71 | | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
71 | 85 | | |
72 | 86 | | |
73 | 87 | | |
| |||
202 | 216 | | |
203 | 217 | | |
204 | 218 | | |
| 219 | + | |
| 220 | + | |
205 | 221 | | |
206 | 222 | | |
207 | 223 | | |
| |||
213 | 229 | | |
214 | 230 | | |
215 | 231 | | |
| 232 | + | |
216 | 233 | | |
217 | 234 | | |
218 | 235 | | |
219 | 236 | | |
220 | 237 | | |
| 238 | + | |
| 239 | + | |
221 | 240 | | |
222 | 241 | | |
223 | 242 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
40 | 40 | | |
41 | 41 | | |
42 | 42 | | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
43 | 54 | | |
44 | 55 | | |
45 | 56 | | |
| |||
0 commit comments