Skip to content

Place inner container cgroups under the pod's cgroup tree#169

Open
jamestoyer wants to merge 2 commits intocoder:mainfrom
jamestoyer:feat/dind-cgroup-attribution
Open

Place inner container cgroups under the pod's cgroup tree#169
jamestoyer wants to merge 2 commits intocoder:mainfrom
jamestoyer:feat/dind-cgroup-attribution

Conversation

@jamestoyer
Copy link
Copy Markdown

Closes #168

Summary

Wrap the dockerd invocation with unshare --cgroup, a fresh cgroup2 remount, and the cgroup-delegation logic from moby's hack/dind script. After the wrapper runs, dockerd's view of /sys/fs/cgroup is rooted at the envbox container's own cgroup, so all inner container cgroups are created as descendants of that scope on the host. Host-level cgroup-aware observability tools (Tetragon, Falco, custom eBPF agents) can then attribute processes inside inner containers to the parent pod.

This is the approach Isovalent (Cilium/Tetragon vendor) recommended after testing it themselves — see moby/moby#45378 (comment).

Changes

cli/docker.go

  • Adds wrapDockerdCmd(dargs) helper that returns the unshare-prefixed command + args. The shell snippet does:
    • umount /sys/fs/cgroup + mount -t cgroup2 cgroup /sys/fs/cgroup — fresh cgroup2 mount rooted at the new cgroup namespace's root (the envbox container's cgroup on the host)
    • mkdir /sys/fs/cgroup/init + retry loop that moves all processes via xargs -rn1 < cgroup.procs > init/cgroup.procs and enables every available controller in cgroup.subtree_control — needed because cgroupv2 forbids a cgroup from having both processes and subtree_control set
    • exec into dockerd with the original args
  • Updates the three sites that launch/restart dockerd (initial start and the two IsNoSpaceErr recovery paths) to go through the wrapper.

xunix/sys.go

  • Adds a fallback in readCPUQuotaCGroupV2: if /sys/fs/cgroup/<self>/cpu.max doesn't exist (because the cgroup remount above has reparented the mount root), read /sys/fs/cgroup/cpu.max directly. The mount root IS the current cgroup in that case so the values are equivalent.

Why no --mount on unshare?

unshare --cgroup alone leaves the mount namespace shared with the parent envbox process, so the umount + mount -t cgroup2 operations leak back to envbox. We tried isolating with --mount:

  • --propagation private → breaks sysbox-fs: its per-container mounts under /var/lib/sysboxfs/<id>/ stop being visible to sysbox-runc in the dockerd namespace and inner-container creation fails.
  • --propagation slave → same failure: the parent mount points are private by default, so nothing propagates to the slave child.

Making /var/lib/sysboxfs/ (or /) rshared in envbox's mount namespace before the unshare would work but is more invasive. The pragmatic compromise: accept the mount-namespace leak and handle the one observable side effect (the CPU quota read) via the xunix/sys.go fallback. CPU enforcement is unaffected — the inner workspace container's cgroup is now a descendant of the pod's cgroup tree, so the pod's resource limits apply transitively.

Verification

Tested against:

  • containerd 2.2.1
  • Linux 6.8 / Ubuntu 22.04 with cgroup v2
  • Tetragon v1.18.1 with --enable-cgtrackerid=true

End-to-end check: after a workspace start, running whoami from a Coder terminal (inside the inner workspace container) now produces a Tetragon process_exec event with a full pod field — namespace, name, UID, container ID, image, security context, and pod labels. Before this change the same event had no pod field at all.

Workspace startup is otherwise unchanged. Sysbox-fs continues to provide its /sys and /proc virtualization correctly. The "no internal processes" cgroupv2 rule that previously blocked nested container creation under a privileged pod's .scope cgroup is now satisfied because all envbox/sysbox processes are first migrated into the /init sibling cgroup.

Backwards compatibility

The wrapper is a no-op for cgroup configurations the delegation block doesn't apply to:

  • cgroup v1 hosts: the if [ -f /sys/fs/cgroup/cgroup.controllers ] guard skips the delegation block; only the umount/mount happens.
  • Kernels without unshare --cgroup (< 4.6, very rare today): would surface as a startup error.

For cgroup v2 hosts the only behavioural change is the cgroup placement of inner containers — which is what we want.

References

@f0ssel f0ssel requested a review from johnstcn May 4, 2026 17:17
@f0ssel
Copy link
Copy Markdown
Member

f0ssel commented May 4, 2026

Hey there, I've tagged another coder engineer to do the approval, but I did read through and have a agent help review.

Overview

This PR wraps dockerd invocation with unshare --cgroup + cgroup2 remount + cgroupv2 delegation so inner container cgroups become descendants of the pod's cgroup tree. This enables host-level observability tools (Tetragon, Falco, eBPF agents) to attribute inner-container processes to the parent pod.

The approach is sound — it mirrors moby's canonical hack/dind pattern and was recommended by Isovalent.

CI Status

Check Result Related to PR?
build FAIL No — aquasecurity/trivy-action@0.34.2 version resolution failure
unit-tests FAIL YesTestDocker/DockerdConfigured expects dockerd but now gets unshare
integration-tests FAIL Yes — same DockerdConfigured test failure
lint PASS
fmt PASS
codeql PASS

Issues Found

1. CRITICAL: background.Process.kill() race after exec chain

This is the most significant correctness concern. The background.Process tracks processes by binName set from the cmd argument to New(). After wrapDockerdCmd, binName = "unshare". However, due to the exec chain (unshareshdockerd), the actual process's /proc/<pid>/cmdline becomes "dockerd".

In background/process.go, the kill() function sends SIGTERM then polls isProcExited():

func isProcExited(fs afero.Fs, pid int, cmd string) (bool, error) {
    cmdline, err := afero.ReadFile(fs, fmt.Sprintf("/proc/%d/cmdline", pid))
    // ...
    args := bytes.Split(cmdline, []byte{'0'})
    return cmd != string(args[0]), nil  // "unshare" != "dockerd" → true (thinks it exited!)
}

Since "unshare" != "dockerd", isProcExited returns true immediately after SIGTERM is sent — before dockerd has actually exited. Restart() then starts a new instance while the old one may still be binding ports/sockets. This could cause startup failures or data corruption on the two no-space recovery restart paths.

Before this PR: binName was "dockerd", and the process cmdline was dockerd, so the PID exit check worked correctly.

Fix: Either:

  • Pass "dockerd" as binName separately (requires refactoring Process)
  • Have wrapDockerdCmd return a binName hint and use it
  • Or simplest: after the kill() call in Restart, also wait for the process channel to drain before starting

2. Must-fix: Existing test DockerdConfigured is broken

The test at docker_test.go:182-191 registers a fake command matching dockerd --debug --log-level=debug .... After this PR the binary is unshare, not dockerd, so the fake is never matched. The test needs to expect unshare --cgroup /bin/sh -c <script> dockerd <args> instead.

3. Missing test coverage for new code

  • wrapDockerdCmd: No unit test. Should verify the returned command/args structure, the exec "$0" "$@" pattern, and argument passthrough.
  • readCPUQuotaCGroupV2 fallback: The existing sys_test.go has no test case where the self-relative path fails but /sys/fs/cgroup/cpu.max succeeds. Should add a CGroupV2_RemountedRoot test case.

4. Minor: Error wrapping in fallback path (xunix/sys.go:78)

if rootErr != nil {
    return CPUQuota{}, xerrors.Errorf("read cpu.max outside container: %w", err) // wraps `err`, not `rootErr`
}

When both reads fail, the error message says "read cpu.max outside container" but wraps err (the self-relative path error) rather than rootErr (the fallback error). A developer debugging this might be confused since the error path and the wrapped error don't match. Consider wrapping both:

return CPUQuota{}, xerrors.Errorf("read cpu.max: self-path: %v, root-path: %w", err, rootErr)

5. Shell script: infinite retry loop with no bound

while ! {
    xargs -rn1 < /sys/fs/cgroup/cgroup.procs > /sys/fs/cgroup/init/cgroup.procs || :
    sed -e 's/ / +/g' -e 's/^/+/' < /sys/fs/cgroup/cgroup.controllers > /sys/fs/cgroup/cgroup.subtree_control
}; do :; done

This loop has no timeout or retry limit. If the subtree_control write persistently fails (e.g., a controller that can't be enabled), this blocks dockerd startup forever with no logs. Moby's hack/dind has the same pattern, but moby runs as a developer tool, not a long-lived production component.

Consider adding a max-retry counter with a warning message, or at minimum a sleep to avoid CPU-spinning on persistent failures.

6. umount /sys/fs/cgroup without guard

The script runs umount /sys/fs/cgroup unconditionally before the cgroupv2 guard. If cgroup is not mounted at that path (unusual but possible), set -e will abort the entire script before reaching exec dockerd. This would prevent workspace startup entirely with a cryptic error. Consider umount /sys/fs/cgroup 2>/dev/null || true or a mountpoint -q check.

Things Done Well

  • Excellent PR description: The tradeoffs (no --mount, mount namespace leak) are clearly documented with reasoning.
  • Minimal footprint: Only 2 files changed, well-scoped.
  • cgroup v1 guard: The if [ -f /sys/fs/cgroup/cgroup.controllers ] correctly skips delegation on v1.
  • exec "$0" "$@" pattern: Properly replaces the shell with dockerd, preserving the PID for signal handling.
  • CPU quota fallback: The fallback to reading /sys/fs/cgroup/cpu.max at mount root is the right approach for the mount-namespace leak side effect.
  • All three dockerd launch sites updated consistently.

Recommendation

Request changes. The isProcExited / binName mismatch (issue #1) is a real correctness regression for the restart paths. The broken test (issue #2) must also be fixed. Issues #3-6 are worth discussing but less critical.

@f0ssel
Copy link
Copy Markdown
Member

f0ssel commented May 4, 2026

I believe this issue with aquasecurity/trivy-action@0.34.2 is fixed if you rebase fresh from main, as I believe the build error has been fixed. Thanks

Comment thread cli/docker.go Outdated
Comment on lines +887 to +911
// wrapDockerdCmd wraps the dockerd invocation with `unshare --cgroup` +
// cgroup2 remount + cgroupv2 delegation setup. This creates a new cgroup
// namespace for dockerd and moves existing processes into a sibling `/init`
// cgroup so the root cgroup can enable subtree_control. Without this, cgroupv2
// "no internal processes" rule prevents Docker from creating child cgroups
// with processes under a cgroup that already has processes.
//
// Note: the `umount /sys/fs/cgroup && mount -t cgroup2 ...` leaks back into
// envbox's mount namespace because we don't use `--mount`. Using `--mount`
// would break sysbox-fs propagation (its per-container mounts under
// /var/lib/sysboxfs/ stop being visible to sysbox-runc in the dockerd NS).
// The side effect of the leak is a non-fatal "Unable to read CPU quota"
// warning when envbox later tries to read /sys/fs/cgroup/<host-path>/cpu.max
// — that path no longer exists under the remounted view. The inner
// container's CPU limits still flow through the parent pod's cgroup hierarchy,
// but cgroup-aware tools inside the workspace may see the host CPU count.
//
// The cgroup delegation logic mirrors moby's `hack/dind` wrapper:
// https://github.com/moby/moby/blob/master/hack/dind
//
// After this setup, inner container cgroups are placed under the envbox
// container's cgroup on the host. Tetragon's cgtracker can then associate
// these child cgroups with the parent pod for attribution.
//
// See also: https://github.com/moby/moby/issues/45378#issuecomment-2886261231
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can we permalink directly to the relevant portion https://github.com/moby/moby/blob/ee3e21b70457f90d537640613d59340db1f0178c/hack/dind#L61-L79

  2. I'd prefer if we kept the verbatim comments from upstream, if possible

Comment thread cli/docker.go Outdated
Comment on lines +914 to +915
umount /sys/fs/cgroup
mount -t cgroup2 cgroup /sys/fs/cgroup
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we didn't do this on a cgroupv1 system. Would it make more sense to do it after the check on L916?

@jamestoyer
Copy link
Copy Markdown
Author

Ok I think I've addressed the PR feedback (well I say me, I mean Claude). I have tested this out on our version of Coder and the changes are still producing the expected outcomes, which is good.

@f0ssel It doesn't look like main has updated since I started this PR so rebasing won't do anything.

@f0ssel
Copy link
Copy Markdown
Member

f0ssel commented May 6, 2026

Ok I think I've addressed the PR feedback (well I say me, I mean Claude). I have tested this out on our version of Coder and the changes are still producing the expected outcomes, which is good.

@f0ssel It doesn't look like main has updated since I started this PR so rebasing won't do anything.

You are right, I was thinking this was coder/coder, I think we need to patch main here like we did there to fix this issue. CC @jdomeracki-coder

@johnstcn
Copy link
Copy Markdown
Member

johnstcn commented May 7, 2026

Ok I think I've addressed the PR feedback (well I say me, I mean Claude). I have tested this out on our version of Coder and the changes are still producing the expected outcomes, which is good.
@f0ssel It doesn't look like main has updated since I started this PR so rebasing won't do anything.

You are right, I was thinking this was coder/coder, I think we need to patch main here like we did there to fix this issue. CC @jdomeracki-coder

#166

Copy link
Copy Markdown
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor! Some review comments below.

Also noting that the code comments are overly verbose and could be trimmed down a good bit. Claude is generally pretty good at this when you tell it to.

Comment thread background/process.go
Comment on lines +70 to +75
func (d *Process) SetBinName(name string) {
d.mu.Lock()
defer d.mu.Unlock()

d.binName = name
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After every restart, callers need to remember to set this otherwise it gets lost.
Would it make sense to instead make this a required parameter?

Comment thread cli/docker_test.go
Comment on lines +189 to +191
// dockerd is launched via an unshare wrapper that exec's into
// dockerd with these args; assert against the full wrapped argv.
wrapCmd, wrapArgs := cli.WrapDockerdCmd(dockerdArgs)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the test tautological. You should instead explicitly specify the expected argv.

Comment thread cli/docker.go
Comment on lines +938 to +971
shellCmd := fmt.Sprintf(`set -e
# cgroup v2: enable nesting
if [ -f /sys/fs/cgroup/cgroup.controllers ]; then
# Remount /sys/fs/cgroup so the new cgroup namespace's view becomes the
# fs root. Inner container cgroups will then be created under the envbox
# container's cgroup on the host.
umount /sys/fs/cgroup
mount -t cgroup2 cgroup /sys/fs/cgroup

# move the processes from the root group to the /init group,
# otherwise writing subtree_control fails with EBUSY.
# An error during moving non-existent process (i.e., "cat") is ignored.
mkdir -p /sys/fs/cgroup/init
# this happens in a loop because things like "docker exec" on our dind
# container will create new processes, which creates a race between our
# moving everything to "init" and enabling subtree_control
envbox_attempts=0
while ! {
# move the processes from the root group to the /init group,
# otherwise writing subtree_control fails with EBUSY.
# An error during moving non-existent process (i.e., "cat") is ignored.
xargs -rn1 < /sys/fs/cgroup/cgroup.procs > /sys/fs/cgroup/init/cgroup.procs || :
# enable controllers
sed -e 's/ / +/g' -e 's/^/+/' < /sys/fs/cgroup/cgroup.controllers \
> /sys/fs/cgroup/cgroup.subtree_control
}; do
envbox_attempts=$((envbox_attempts + 1))
if [ "$envbox_attempts" -ge %d ]; then
echo "envbox: failed to enable cgroup.subtree_control after $envbox_attempts attempts" >&2
exit 1
fi
done
fi
exec "$0" "$@"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: use //go:embed to extract this to a standalone file. You then get syntax highlighting and the ability to test it independently.

Comment thread cli/docker.go
Comment on lines +904 to +906
// dockerdSubtreeControlMaxAttempts caps the cgroup-subtree-control retry loop
// so a stuck race can't block envbox startup indefinitely.
const dockerdSubtreeControlMaxAttempts = 100
Copy link
Copy Markdown
Member

@johnstcn johnstcn May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth mentioning this is a divergence from the upstream implementation. (I think we should keep it though)

Comment thread cli/docker.go
Comment on lines +944 to +945
umount /sys/fs/cgroup
mount -t cgroup2 cgroup /sys/fs/cgroup
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: set -e will cause this to abort with no context-specific error message, so consider something like

umount /sys/fs/cgroup || echo "envbox: failed to umount cgroup" >&2; exit 1;

@johnstcn
Copy link
Copy Markdown
Member

johnstcn commented May 7, 2026

You are right, I was thinking this was coder/coder, I think we need to patch main here like we did there to fix this issue. CC @jdomeracki-coder

#166

This has been merged! You should be good to rebase or merge main now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Processes inside the inner workspace container cannot be attributed to the pod by host-level observability tools

3 participants