Skip to content

Overhaul spawnveg to use clone(2) and fix many related spawning bugs#888

Open
JohnoKing wants to merge 11 commits intoksh93:devfrom
JohnoKing:clone-overhaul
Open

Overhaul spawnveg to use clone(2) and fix many related spawning bugs#888
JohnoKing wants to merge 11 commits intoksh93:devfrom
JohnoKing:clone-overhaul

Conversation

@JohnoKing
Copy link

This commit is a rewrite of #881. The previous pull request didn't address lingering bugs with E2BIG handling and file descriptor leaks, proving unsatisfactory. After support for direct usage of clone was added to spawnveg, the previous fixes for the posix_spawn implementation wound up unused and were scrapped (although those changes do work just fine and can be readded if that's desirable).

Alterations:

  • Add a version of spawnveg that creates a new process with clone(2) outright. This method is generally faster than using posix_spawn(3), and is also more portable (because Linux with musl libc and NetBSD both provide clone, but not posix_spawn_file_actions_addtcsetpgrp_np). This code was initially based on the _real_vfork code previously removed in ef9a5c7 (unlike vfork(2), there is no problem with doing stuff prior to execve for clone with CLONE_VM | CLONE_VFORK).
    • Remove the now unused code implementing tcsetpgrp via posix_spawn_file_actions_addtcsetpgrp_np(). This code was only ever used with glibc 2.35+. (Not only that, but the only other platform that ever considered implementing a similar function was NetBSD1, where it appears to have been abandoned. This is ironic considering ksh no longer uses posix_spawn on NetBSD as of this commit.) This code may be worth revisiting if io_uring_spawn()2 is ever actualized (that project appears to be rather inactive though).
      Demonstration of the (small) performance improvement garnered from using clone directly rather than via posix_spawn:
time ksh-posix-spawn -c 'for i in {1..15000}; do /bin/true; done'

real	0m05.461s
user	0m03.646s
sys	0m01.869s

time ksh-clone -c 'for i in {1..15000}; do /bin/true; done'

real	0m05.373s
user	0m03.629s
sys	0m01.738s
  • Bugfix: Don't set the terminal signals to their defaults in the parent prior to calling spawnveg. This is the primary cause of a lockup in the pty tests which can be reproduced as follows:
$ exec ksh
$ bin/shtests pty 2>/dev/null
^C ^C ^C (SIGINT doesn't work anymore and may segfault)

The correct way to go about dealing with SIGT* is to set those to SIG_DFL in the child process3.

  • Added support for tcsetpgrp to the fork fallback in spawnveg. Some form of this appears to have already been attempted in AT&T olden times, but that old version was broken and needed bugfixes desperately.
    • If the child needs tcsetpgrp, block the terminal signals in the parent process via sigcritical().
  • Now that the fork fallback for spawnveg works correctly in interactive terminals, prefer that to the sh_fork() codepath on operating systems without clone support. Even though the underlying system call is still ultimately fork, the sh_ntfork codepath is faster than the traditional sh_fork codepath. Benchmark:
# Note: This benchmark was done on FreeBSD, though the results can be replicated elsewhere
$ time /tmp/ksh-devbranch -ic "for i in {1..10000}; do $(whence -p true); done"
     real       0m03.302s
     user       0m00.988s
     sys        0m02.320s
$ time /tmp/ksh-newspawnveg -ic "for i in {1..10000}; do $(whence -p true); done"
     real       0m03.160s
     user       0m01.187s
     sys        0m01.977s
  • To that end, split up the spawnveg implementations into spawnveg_fast and spawnveg_slow. Choose the appropriate one when spawnveg is called; this removes the need for the xec.c ifdef hackery.
    • Removed the ntfork_tcpgrp ifdefs from xec.c; spawnveg can handle it by itself now.
  • Bugfix: With the spawnveg_fast and spawnveg_slow innovation, spawnveg now always has support for setsid. It'll fallback to fork() if POSIX_SPAWN_SETSID and clone() aren't available.
  • Bugfix: For the posix_spawn version of spawnveg, the flags should be of the short type pursuant to the POSIX specification.
  • Optimization: Use pipe2 in the fork fallback for spawnveg when it's available to avoid two fcntl syscalls.
  • Updated the spawnveg documentation to reflect the new changes.
  • _spawnveg(): Restore the terminal process group immediately after any relevant failed spawn attempt, rather than only in sh_ntfork().
    • Added a regression test to pty.sh for a crash that occurred because of the erroneous tcsetpgrp placement.
      Reproducer that could cause ksh to prematurely exit (for Linux systems with glibc 2.35+):
$ $(type -p true) /ultralongpathname/{1..1000000}
/usr/bin/ksh: /usr/bin/true: /usr/bin/true: cannot execute [Argument list too long]
$ # Type any character
  • path_spawn(): Do not print an error message and longjmp away upon encountering E2BIG or some other error; let the calling functions take care of that.
  • path_exec(): Added handling for E2BIG.
  • Fixed a bug that caused sh_ntfork to leak file descriptors upon encountering an error (re: 8f848bc). (This bug/regression was encountered after fixing the bad tcsetpgrp usage.) Reproducer:
#!/bin/ksh
ls /proc/$$/fd
redirect 2>/dev/null
touch /tmp/file
for i in {1..20}; do
    command /tmp/file <(echo)
done
ls /proc/$$/fd
  • Added regression tests for the leak that use command and command -x.
    • Fixed an old regression test that was giving false negatives.
  • Clarified the intent of the sigcritical() nesting matches. The ksh2020 devs appeared to have been confused by this line of code, so some additional clarification explaining what it does should be helpful for posterity.
  • path.sh: Added tests for the exit status of commands run when forked with the & operator.

Footnotes

  1. https://blog.netbsd.org/tnf/entry/gsoc_reports_make_system_31

  2. https://lwn.net/Articles/908268/

  3. cf. _sh_fork()

… bugs

Alterations:
- Add a version of spawnveg that clones processes with clone(2)
  outright. This method is generally faster than using posix_spawn,
  and is also more portable (since Linux with musl libc *and* NetBSD
  both provide this function). The code was initially based on the
  _real_vfork spawnveg code previously removed in ef9a5c7 (unlike
  vfork(2), there is no problem with running code prior to execve
  for clone with CLONE_VM|CLONE_VFORK).
  - Remove the now dead code implementing tcsetpgrp via
    posix_spawn_file_actions_addtcsetpgrp_np(). This code was only
    ever used with glibc 2.35+. (Not only that, but the only other
    platform that ever considered implementing such a call was
    NetBSD[^1], where it appears to have been abandoned. This is
    ironic considering ksh no longer uses posix_spawn on NetBSD
    as of this commit.)
    This code may be worth visiting if io_uring_spawn()[^2] is ever
    actualized (that project appears to be quite inactive though).
- Bugfix: Don't set the terminal signals to their defaults
  in the parent prior to calling spawnveg. This is the primary
  cause of a lockup in the pty tests which can be reproduced
  as follows:
    $ exec ksh
    $ bin/shtests pty 2>/dev/null
    ^C ^C ^C (SIGINT doesn't work anymore and may segfault)
  The correct way to go about dealing with SIGT* is to set
  those to SIG_DFL in the child process (cf. _sh_fork()).
- Added support for tcsetpgrp to the fork fallback in spawnveg.
  Some form of this appears to have already been attempted in
  AT&T olden times, but that old version was broken and needed
  bugfixes desperately.
  - If the child needs tcsetpgrp, block the terminal signals
    in the parent process via sigcritical(). The posix_spawn
    version doesn't need this because posix_spawn will block
    signals automatically and therefore doesn't need sigcritical.
- Now that the fork fallback for spawnveg works correctly in
  interactive terminals, prefer that to the sh_fork() codepath
  on operating systems without posix_spawn tcsetpgrp support.
  Even though the underlying system call is still ultimately fork,
  the sh_ntfork() codepath is faster than the traditional sh_fork
  codepath. Benchmark:
     $ time /tmp/ksh-devbranch -ic "for i in {1..10000}; do $(whence -p true); done"
     real       0m03.302s
     user       0m00.988s
     sys        0m02.320s
     $ time /tmp/ksh-newspawnveg -ic "for i in {1..10000}; do $(whence -p true); done"
     real       0m03.160s
     user       0m01.187s
     sys        0m01.977s
- To that end, split up the spawnveg versions into spawnveg_fast
  and spawnveg_slow. Choose the appropriate one when spawnveg is
  called; this removes the need for the xec.c ifdef hackery.
  - Removed the ntfork_tcpgrp ifdefs from xec.c; spawnveg can
    handle it by itself now.
- Bugfix: With the spawnveg_fast and spawnveg_slow innovation,
  spawnveg now always has support for setsid. It'll fallback to
  fork if POSIX_SPAWN_SETSID and clone(2) aren't available.
- Bugfix: For the posix_spawn version of spawnveg, the flags should
  be of the short type pursuant to the POSIX specification.
- Optimization: Use pipe2 in the fork fallback for spawnveg when
  it's available to avoid two fcntl syscalls.
- Updated the spawnveg documentation to reflect the new changes.
- _spawnveg(): Restore the terminal process group immediately
  after any failed spawn attempt, rather than only in sh_ntfork().
  - Added a regression test for a crash that occurred because of
    the erroneous tcsetpgrp placement in sh_ntfork().
- path_spawn(): Do not print an error message and longjmp away
  upon encountering E2BIG or some other error; let the calling functions
  take care of that.
- path_exec(): Added handling for E2BIG.
- Fixed a bug that caused sh_ntfork to leak file descriptors
  upon encountering an error (re: 8f848bc). Reproducer:
     #!/bin/ksh
     ls /proc/$$/fd
     redirect 2>/dev/null
     touch /tmp/file
     for i in {1..20}; do
         command /tmp/file <(echo)
     done
     ls /proc/$$/fd
  - Added regression tests for the leak that use 'command' and
    'command -x'.
  - Fixed an old regression test that was giving false negatives.
- Clarified the intent of the sigcrit nesting matches. The ksh2020
  devs appeared to have been confused by this line, so some
  additional clarification explaining the intent should be helpful
  for posterity.
- path.sh: Added tests for the exit status of commands run when forked
  with the & operator.

[^1]: https://blog.netbsd.org/tnf/entry/gsoc_reports_make_system_31
[^2]: https://lwn.net/Articles/908268/
@McDutchie
Copy link

I've been testing this PR, and AFAICT it works well. I'm a bit out of my depth here, and I still don't have time in my life to study the intricacies of clone(2), etc., in great detail.

One thing that caught my eye is that, in the clone(2) version of spawnveg_fast():

 *   - The child stack is allocated via a function local 'char stack[]'
 *     like in musl, which is faster than using mmap ala glibc.

So, AFAICT, that means the child's C stack is part of the parent's C stack, a grandchild's C stack will be part of the grandparent's C stack, etc. Does that not mean that the child stack will necessarily be a lot smaller, and that we're going to run out of parent stack space pretty quickly if children keep spawning children? Also, would this not fail if a shell function tries to run an external command from a sufficiently high level of recursive shell function calls? Or am I missing something here?

@McDutchie
Copy link

Or am I missing something here?

I must indeed be missing something, because launching 8000 recursive ksh processes via the clone(2) version of spawnveg_fast() is working just fine on my Linux arm64 VM. I don't get it; please explain if you can :)

Hmmm... yet, the clone(2) man page says:

       The stack argument specifies the location of the stack used by the child
       process.  Since the child and calling process may share  memory,  it  is
       not  possible  for the child process to execute in the same stack as the
       calling process.  The calling process must therefore set up memory space
       for the child stack and pass a pointer to this space to clone(). 

Does this not prohibit allocating the child stack within the parent's stack?

It continues:

                                                                         Stacks
       grow downward on all processors that run Linux (except the HP PA proces‐
       sors), so stack usually points to the  topmost  address  of  the  memory
       space  set up for the child stack.

The current PR code (spawnveg.c:162) passes a pointer stack+STACK_SIZE that is one past the topmost address of the stack[] array:

        pid = clone(exec_process, stack+STACK_SIZE, CLONE_VM|CLONE_VFORK|SIGCHLD, &args);

Is that not out of bounds by one byte? Also, does the above mean that this will fail on Linux on HP PA processors?

This is not because it's impossible to use clone on those processors,
but because I don't have one to test with, so better safe than sorry.
@JohnoKing
Copy link
Author

JohnoKing commented Feb 15, 2026

Regarding stack allocation, I believe execve has to make a new stack when called inside of a vfork, so the parent's local stack shouldn't matter past that point; the local stack is used while running exec_process() and setup_child().

Is that not out of bounds by one byte?

I believe the C standard explicitly allows one byte past for stack+sizeof(stack) (of course, I could also be missing something here).

In any case I've pushed a new commit that avoids clone(2) on systems with a stack that grows upwards (not because it's impossible to use clone on such architectures, but because I don't have a system with that behavior).

@JohnoKing
Copy link
Author

JohnoKing commented Feb 15, 2026

The code below might work on HP PA, but I can't test it myself since I don't have an HP PA CPU.

diff --git a/src/lib/libast/comp/spawnveg.c b/src/lib/libast/comp/spawnveg.c
index 080762e7a..c26f5fc56 100644
--- a/src/lib/libast/comp/spawnveg.c
+++ b/src/lib/libast/comp/spawnveg.c
@@ -34,10 +34,6 @@
 #include <ast_tty.h>
 #include <ast_fcntl.h>
 
-#if _machine_stack_grows_up
-#undef _lib_clone  /* Don't use clone on systems where the stack grows upwards (lacks testing) */
-#endif
-
 /*
  * Set the SID, PGID and TCPGRP in the child process
  * after forking.
@@ -155,6 +151,11 @@ spawnveg_fast(const char* path, char* const argv[], char* const envv[], pid_t pg
 	pid_t		pid;
 	char		stack[STACK_SIZE];
 	struct cargs	args;
+#if _machine_stack_grows_up
+	void		*stack_top = stack;
+#else
+	void		*stack_top = stack+STACK_SIZE;
+#endif
 
 	args.path = path;
 	args.argv = (char**)argv;
@@ -163,7 +164,7 @@ spawnveg_fast(const char* path, char* const argv[], char* const envv[], pid_t pg
 	args.pgid = pgid;
 	args.tcfd = tcfd;
 	sigcritical(SIG_REG_EXEC|SIG_REG_PROC|(tcfd>=0?SIG_REG_TERM:0));
-	pid = clone(exec_process, stack+STACK_SIZE, CLONE_VM|CLONE_VFORK|SIGCHLD, &args);
+	pid = clone(exec_process, stack_top, CLONE_VM|CLONE_VFORK|SIGCHLD, &args);
 	if (pid == -1)
 		args.err = errno;
 	else if (args.err)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants