Place inner container cgroups under the pod's cgroup tree by jamestoyer · Pull Request #169 · coder/envbox

jamestoyer · 2026-04-27T12:03:02Z

Closes #168

Summary

Wrap the dockerd invocation with unshare --cgroup, a fresh cgroup2 remount, and the cgroup-delegation logic from moby's hack/dind script. After the wrapper runs, dockerd's view of /sys/fs/cgroup is rooted at the envbox container's own cgroup, so all inner container cgroups are created as descendants of that scope on the host. Host-level cgroup-aware observability tools (Tetragon, Falco, custom eBPF agents) can then attribute processes inside inner containers to the parent pod.

This is the approach Isovalent (Cilium/Tetragon vendor) recommended after testing it themselves — see moby/moby#45378 (comment).

Changes

cli/docker.go

Adds wrapDockerdCmd(dargs) helper that returns the unshare-prefixed command + args. The shell snippet does:
- umount /sys/fs/cgroup + mount -t cgroup2 cgroup /sys/fs/cgroup — fresh cgroup2 mount rooted at the new cgroup namespace's root (the envbox container's cgroup on the host)
- mkdir /sys/fs/cgroup/init + retry loop that moves all processes via xargs -rn1 < cgroup.procs > init/cgroup.procs and enables every available controller in cgroup.subtree_control — needed because cgroupv2 forbids a cgroup from having both processes and subtree_control set
- exec into dockerd with the original args
Updates the three sites that launch/restart dockerd (initial start and the two IsNoSpaceErr recovery paths) to go through the wrapper.

xunix/sys.go

Adds a fallback in readCPUQuotaCGroupV2: if /sys/fs/cgroup/<self>/cpu.max doesn't exist (because the cgroup remount above has reparented the mount root), read /sys/fs/cgroup/cpu.max directly. The mount root IS the current cgroup in that case so the values are equivalent.

Why no `--mount` on unshare?

unshare --cgroup alone leaves the mount namespace shared with the parent envbox process, so the umount + mount -t cgroup2 operations leak back to envbox. We tried isolating with --mount:

--propagation private → breaks sysbox-fs: its per-container mounts under /var/lib/sysboxfs/<id>/ stop being visible to sysbox-runc in the dockerd namespace and inner-container creation fails.
--propagation slave → same failure: the parent mount points are private by default, so nothing propagates to the slave child.

Making /var/lib/sysboxfs/ (or /) rshared in envbox's mount namespace before the unshare would work but is more invasive. The pragmatic compromise: accept the mount-namespace leak and handle the one observable side effect (the CPU quota read) via the xunix/sys.go fallback. CPU enforcement is unaffected — the inner workspace container's cgroup is now a descendant of the pod's cgroup tree, so the pod's resource limits apply transitively.

Verification

Tested against:

containerd 2.2.1
Linux 6.8 / Ubuntu 22.04 with cgroup v2
Tetragon v1.18.1 with --enable-cgtrackerid=true

End-to-end check: after a workspace start, running whoami from a Coder terminal (inside the inner workspace container) now produces a Tetragon process_exec event with a full pod field — namespace, name, UID, container ID, image, security context, and pod labels. Before this change the same event had no pod field at all.

Workspace startup is otherwise unchanged. Sysbox-fs continues to provide its /sys and /proc virtualization correctly. The "no internal processes" cgroupv2 rule that previously blocked nested container creation under a privileged pod's .scope cgroup is now satisfied because all envbox/sysbox processes are first migrated into the /init sibling cgroup.

Backwards compatibility

The wrapper is a no-op for cgroup configurations the delegation block doesn't apply to:

cgroup v1 hosts: the if [ -f /sys/fs/cgroup/cgroup.controllers ] guard skips the delegation block; only the umount/mount happens.
Kernels without unshare --cgroup (< 4.6, very rare today): would surface as a startup error.

For cgroup v2 hosts the only behavioural change is the cgroup placement of inner containers — which is what we want.

References

DinD cgroupv2 problem inside K8s moby/moby#45378 (and Isovalent's comment that inspired this)
moby's hack/dind script (cgroup delegation logic borrowed from there)
Cilium Tetragon's cgtracker (consumes the new hierarchy on the agent side)

f0ssel · 2026-05-04T19:12:02Z

Hey there, I've tagged another coder engineer to do the approval, but I did read through and have a agent help review.

Overview

This PR wraps dockerd invocation with unshare --cgroup + cgroup2 remount + cgroupv2 delegation so inner container cgroups become descendants of the pod's cgroup tree. This enables host-level observability tools (Tetragon, Falco, eBPF agents) to attribute inner-container processes to the parent pod.

The approach is sound — it mirrors moby's canonical hack/dind pattern and was recommended by Isovalent.

CI Status

Check	Result	Related to PR?
build	FAIL	No — `aquasecurity/trivy-action@0.34.2` version resolution failure
unit-tests	FAIL	Yes — `TestDocker/DockerdConfigured` expects `dockerd` but now gets `unshare`
integration-tests	FAIL	Yes — same `DockerdConfigured` test failure
lint	PASS	—
fmt	PASS	—
codeql	PASS	—

Issues Found

1. CRITICAL: `background.Process.kill()` race after exec chain

This is the most significant correctness concern. The background.Process tracks processes by binName set from the cmd argument to New(). After wrapDockerdCmd, binName = "unshare". However, due to the exec chain (unshare → sh → dockerd), the actual process's /proc/<pid>/cmdline becomes "dockerd".

In background/process.go, the kill() function sends SIGTERM then polls isProcExited():

func isProcExited(fs afero.Fs, pid int, cmd string) (bool, error) {
    cmdline, err := afero.ReadFile(fs, fmt.Sprintf("/proc/%d/cmdline", pid))
    // ...
    args := bytes.Split(cmdline, []byte{'0'})
    return cmd != string(args[0]), nil  // "unshare" != "dockerd" → true (thinks it exited!)
}

Since "unshare" != "dockerd", isProcExited returns true immediately after SIGTERM is sent — before dockerd has actually exited. Restart() then starts a new instance while the old one may still be binding ports/sockets. This could cause startup failures or data corruption on the two no-space recovery restart paths.

Before this PR: binName was "dockerd", and the process cmdline was dockerd, so the PID exit check worked correctly.

Fix: Either:

Pass "dockerd" as binName separately (requires refactoring Process)
Have wrapDockerdCmd return a binName hint and use it
Or simplest: after the kill() call in Restart, also wait for the process channel to drain before starting

2. Must-fix: Existing test `DockerdConfigured` is broken

The test at docker_test.go:182-191 registers a fake command matching dockerd --debug --log-level=debug .... After this PR the binary is unshare, not dockerd, so the fake is never matched. The test needs to expect unshare --cgroup /bin/sh -c <script> dockerd <args> instead.

3. Missing test coverage for new code

wrapDockerdCmd: No unit test. Should verify the returned command/args structure, the exec "$0" "$@" pattern, and argument passthrough.
readCPUQuotaCGroupV2 fallback: The existing sys_test.go has no test case where the self-relative path fails but /sys/fs/cgroup/cpu.max succeeds. Should add a CGroupV2_RemountedRoot test case.

4. Minor: Error wrapping in fallback path (`xunix/sys.go:78`)

if rootErr != nil {
    return CPUQuota{}, xerrors.Errorf("read cpu.max outside container: %w", err) // wraps `err`, not `rootErr`
}

When both reads fail, the error message says "read cpu.max outside container" but wraps err (the self-relative path error) rather than rootErr (the fallback error). A developer debugging this might be confused since the error path and the wrapped error don't match. Consider wrapping both:

return CPUQuota{}, xerrors.Errorf("read cpu.max: self-path: %v, root-path: %w", err, rootErr)

5. Shell script: infinite retry loop with no bound

while ! {
    xargs -rn1 < /sys/fs/cgroup/cgroup.procs > /sys/fs/cgroup/init/cgroup.procs || :
    sed -e 's/ / +/g' -e 's/^/+/' < /sys/fs/cgroup/cgroup.controllers > /sys/fs/cgroup/cgroup.subtree_control
}; do :; done

This loop has no timeout or retry limit. If the subtree_control write persistently fails (e.g., a controller that can't be enabled), this blocks dockerd startup forever with no logs. Moby's hack/dind has the same pattern, but moby runs as a developer tool, not a long-lived production component.

Consider adding a max-retry counter with a warning message, or at minimum a sleep to avoid CPU-spinning on persistent failures.

6. `umount /sys/fs/cgroup` without guard

The script runs umount /sys/fs/cgroup unconditionally before the cgroupv2 guard. If cgroup is not mounted at that path (unusual but possible), set -e will abort the entire script before reaching exec dockerd. This would prevent workspace startup entirely with a cryptic error. Consider umount /sys/fs/cgroup 2>/dev/null || true or a mountpoint -q check.

Things Done Well

Excellent PR description: The tradeoffs (no --mount, mount namespace leak) are clearly documented with reasoning.
Minimal footprint: Only 2 files changed, well-scoped.
cgroup v1 guard: The if [ -f /sys/fs/cgroup/cgroup.controllers ] correctly skips delegation on v1.
exec "$0" "$@" pattern: Properly replaces the shell with dockerd, preserving the PID for signal handling.
CPU quota fallback: The fallback to reading /sys/fs/cgroup/cpu.max at mount root is the right approach for the mount-namespace leak side effect.
All three dockerd launch sites updated consistently.

Recommendation

Request changes. The isProcExited / binName mismatch (issue #1) is a real correctness regression for the restart paths. The broken test (issue #2) must also be fixed. Issues #3-6 are worth discussing but less critical.

f0ssel · 2026-05-04T19:13:30Z

I believe this issue with aquasecurity/trivy-action@0.34.2 is fixed if you rebase fresh from main, as I believe the build error has been fixed. Thanks

johnstcn · 2026-05-06T09:48:26Z

+// wrapDockerdCmd wraps the dockerd invocation with `unshare --cgroup` +
+// cgroup2 remount + cgroupv2 delegation setup. This creates a new cgroup
+// namespace for dockerd and moves existing processes into a sibling `/init`
+// cgroup so the root cgroup can enable subtree_control. Without this, cgroupv2
+// "no internal processes" rule prevents Docker from creating child cgroups
+// with processes under a cgroup that already has processes.
+//
+// Note: the `umount /sys/fs/cgroup && mount -t cgroup2 ...` leaks back into
+// envbox's mount namespace because we don't use `--mount`. Using `--mount`
+// would break sysbox-fs propagation (its per-container mounts under
+// /var/lib/sysboxfs/ stop being visible to sysbox-runc in the dockerd NS).
+// The side effect of the leak is a non-fatal "Unable to read CPU quota"
+// warning when envbox later tries to read /sys/fs/cgroup/<host-path>/cpu.max
+// — that path no longer exists under the remounted view. The inner
+// container's CPU limits still flow through the parent pod's cgroup hierarchy,
+// but cgroup-aware tools inside the workspace may see the host CPU count.
+//
+// The cgroup delegation logic mirrors moby's `hack/dind` wrapper:
+// https://github.com/moby/moby/blob/master/hack/dind
+//
+// After this setup, inner container cgroups are placed under the envbox
+// container's cgroup on the host. Tetragon's cgtracker can then associate
+// these child cgroups with the parent pod for attribution.
+//
+// See also: https://github.com/moby/moby/issues/45378#issuecomment-2886261231


Can we permalink directly to the relevant portion https://github.com/moby/moby/blob/ee3e21b70457f90d537640613d59340db1f0178c/hack/dind#L61-L79

I'd prefer if we kept the verbatim comments from upstream, if possible

johnstcn · 2026-05-06T09:50:06Z

+umount /sys/fs/cgroup
+mount -t cgroup2 cgroup /sys/fs/cgroup


I'd prefer if we didn't do this on a cgroupv1 system. Would it make more sense to do it after the check on L916?

jamestoyer · 2026-05-06T16:38:55Z

Ok I think I've addressed the PR feedback (well I say me, I mean Claude). I have tested this out on our version of Coder and the changes are still producing the expected outcomes, which is good.

@f0ssel It doesn't look like main has updated since I started this PR so rebasing won't do anything.

f0ssel · 2026-05-06T20:06:39Z

Ok I think I've addressed the PR feedback (well I say me, I mean Claude). I have tested this out on our version of Coder and the changes are still producing the expected outcomes, which is good.

@f0ssel It doesn't look like main has updated since I started this PR so rebasing won't do anything.

You are right, I was thinking this was coder/coder, I think we need to patch main here like we did there to fix this issue. CC @jdomeracki-coder

johnstcn · 2026-05-07T09:12:50Z

Ok I think I've addressed the PR feedback (well I say me, I mean Claude). I have tested this out on our version of Coder and the changes are still producing the expected outcomes, which is good.
@f0ssel It doesn't look like main has updated since I started this PR so rebasing won't do anything.

You are right, I was thinking this was coder/coder, I think we need to patch main here like we did there to fix this issue. CC @jdomeracki-coder

#166

johnstcn

Nice refactor! Some review comments below.

Also noting that the code comments are overly verbose and could be trimmed down a good bit. Claude is generally pretty good at this when you tell it to.

johnstcn · 2026-05-07T10:43:38Z

+func (d *Process) SetBinName(name string) {
+	d.mu.Lock()
+	defer d.mu.Unlock()
+
+	d.binName = name
+}


After every restart, callers need to remember to set this otherwise it gets lost.
Would it make sense to instead make this a required parameter?

johnstcn · 2026-05-07T10:47:17Z

+		// dockerd is launched via an unshare wrapper that exec's into
+		// dockerd with these args; assert against the full wrapped argv.
+		wrapCmd, wrapArgs := cli.WrapDockerdCmd(dockerdArgs)


This makes the test tautological. You should instead explicitly specify the expected argv.

johnstcn · 2026-05-07T10:51:21Z

+	shellCmd := fmt.Sprintf(`set -e
+# cgroup v2: enable nesting
+if [ -f /sys/fs/cgroup/cgroup.controllers ]; then
+	# Remount /sys/fs/cgroup so the new cgroup namespace's view becomes the
+	# fs root. Inner container cgroups will then be created under the envbox
+	# container's cgroup on the host.
+	umount /sys/fs/cgroup
+	mount -t cgroup2 cgroup /sys/fs/cgroup
+
+	# move the processes from the root group to the /init group,
+	# otherwise writing subtree_control fails with EBUSY.
+	# An error during moving non-existent process (i.e., "cat") is ignored.
+	mkdir -p /sys/fs/cgroup/init
+	# this happens in a loop because things like "docker exec" on our dind
+	# container will create new processes, which creates a race between our
+	# moving everything to "init" and enabling subtree_control
+	envbox_attempts=0
+	while ! {
+		# move the processes from the root group to the /init group,
+		# otherwise writing subtree_control fails with EBUSY.
+		# An error during moving non-existent process (i.e., "cat") is ignored.
+		xargs -rn1 < /sys/fs/cgroup/cgroup.procs > /sys/fs/cgroup/init/cgroup.procs || :
+		# enable controllers
+		sed -e 's/ / +/g' -e 's/^/+/' < /sys/fs/cgroup/cgroup.controllers \
+			> /sys/fs/cgroup/cgroup.subtree_control
+	}; do
+		envbox_attempts=$((envbox_attempts + 1))
+		if [ "$envbox_attempts" -ge %d ]; then
+			echo "envbox: failed to enable cgroup.subtree_control after $envbox_attempts attempts" >&2
+			exit 1
+		fi
+	done
+fi
+exec "$0" "$@"


suggestion: use //go:embed to extract this to a standalone file. You then get syntax highlighting and the ability to test it independently.

johnstcn · 2026-05-07T10:51:33Z

+// dockerdSubtreeControlMaxAttempts caps the cgroup-subtree-control retry loop
+// so a stuck race can't block envbox startup indefinitely.
+const dockerdSubtreeControlMaxAttempts = 100


Worth mentioning this is a divergence from the upstream implementation. (I think we should keep it though)

johnstcn · 2026-05-07T10:54:02Z

+	umount /sys/fs/cgroup
+	mount -t cgroup2 cgroup /sys/fs/cgroup


suggestion: set -e will cause this to abort with no context-specific error message, so consider something like

umount /sys/fs/cgroup || echo "envbox: failed to umount cgroup" >&2; exit 1;

johnstcn · 2026-05-07T14:29:25Z

You are right, I was thinking this was coder/coder, I think we need to patch main here like we did there to fix this issue. CC @jdomeracki-coder

#166

This has been merged! You should be good to rebase or merge main now.

Place inner container cgroups under the pod's cgroup tree

c5aa82e

f0ssel requested a review from johnstcn May 4, 2026 17:17

johnstcn reviewed May 6, 2026

View reviewed changes

johnstcn assigned jamestoyer May 6, 2026

Address PR review feedback

29ad3ef

johnstcn reviewed May 7, 2026

View reviewed changes

Conversation

jamestoyer commented Apr 27, 2026

Summary

Changes

Why no --mount on unshare?

Verification

Backwards compatibility

References

Uh oh!

f0ssel commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

CI Status

Issues Found

1. CRITICAL: background.Process.kill() race after exec chain

2. Must-fix: Existing test DockerdConfigured is broken

3. Missing test coverage for new code

4. Minor: Error wrapping in fallback path (xunix/sys.go:78)

5. Shell script: infinite retry loop with no bound

6. umount /sys/fs/cgroup without guard

Things Done Well

Recommendation

Uh oh!

f0ssel commented May 4, 2026

Uh oh!

johnstcn May 6, 2026

Choose a reason for hiding this comment

Uh oh!

johnstcn May 6, 2026

Choose a reason for hiding this comment

Uh oh!

jamestoyer commented May 6, 2026

Uh oh!

f0ssel commented May 6, 2026

Uh oh!

johnstcn commented May 7, 2026

Uh oh!

johnstcn left a comment

Choose a reason for hiding this comment

Uh oh!

johnstcn May 7, 2026

Choose a reason for hiding this comment

Uh oh!

johnstcn May 7, 2026

Choose a reason for hiding this comment

Uh oh!

johnstcn May 7, 2026

Choose a reason for hiding this comment

Uh oh!

johnstcn May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnstcn May 7, 2026

Choose a reason for hiding this comment

Uh oh!

johnstcn commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Why no `--mount` on unshare?

f0ssel commented May 4, 2026 •

edited

Loading

1. CRITICAL: `background.Process.kill()` race after exec chain

2. Must-fix: Existing test `DockerdConfigured` is broken

4. Minor: Error wrapping in fallback path (`xunix/sys.go:78`)

6. `umount /sys/fs/cgroup` without guard

johnstcn May 7, 2026 •

edited

Loading