Skip to content

feat(serve): add live pprof HTTP server to serve api command#3201

Merged
dgageot merged 1 commit into
docker:mainfrom
zampani-docker:zampani/pprof-server
Jun 22, 2026
Merged

feat(serve): add live pprof HTTP server to serve api command#3201
dgageot merged 1 commit into
docker:mainfrom
zampani-docker:zampani/pprof-server

Conversation

@zampani-docker

Copy link
Copy Markdown
Contributor

Summary

  • Adds a hidden --pprof-addr flag (and CAGENT_PPROF_ADDR env var) to serve api that starts a Go pprof HTTP server at /debug/pprof/
  • Off by default; operator must explicitly set a TCP host:port to activate — consistent with existing hidden debug flags (--cpuprofile, --memprofile)
  • Binds synchronously — a failed bind is fatal to startup rather than silently swallowed
  • Non-loopback binding emits a runtime Warn log matching the existing warnIfNotLoopback pattern for the API listener

Design details

  • Dedicated http.ServeMux (not DefaultServeMux) — avoids polluting any other listener with pprof routes
  • ReadHeaderTimeout: 10s guards slow-loris; WriteTimeout intentionally omitted — profile/trace captures legitimately run tens of seconds
  • Shutdown via context.WithoutCancel(ctx) + 5s timeout, so the grace window isn't pre-expired when the parent context is already cancelled
  • TCP-only (unix://, npipe://, fd:// not supported; documented in flag help and doc comment)
  • Lint-clean: uses (&net.ListenConfig{}).Listen(ctx, "tcp", addr) (noctx), context.WithoutCancel (gosec G118)

Motivation

Harbor-watch (serve api long-lived deployment) has suspected memory leaks (per-LLM-call HTTP transport churn, session goroutine lifecycle gaps). This is the prerequisite for empirical investigation — capturing heap/goroutine profiles from a running instance.

Test plan

  • go test ./cmd/root/... ./pkg/profiling/... passes
  • golangci-lint run ./cmd/root/... ./pkg/profiling/... — 0 issues
  • Start serve api --pprof-addr 127.0.0.1:6060 ./agents and confirm curl http://localhost:6060/debug/pprof/ returns the pprof index
  • Confirm --pprof-addr does not appear in docker-agent serve api --help
  • In harbor-watch: make run-harbor-watch-api-debug, then go tool pprof http://localhost:6060/debug/pprof/heap

🤖 Generated with Claude Code

Adds a hidden --pprof-addr flag (and CAGENT_PPROF_ADDR env var) to
`serve api` that starts a dedicated pprof HTTP server at /debug/pprof/.
Activated via env var in harbor-watch compose for memory leak profiling.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

@docker-agent docker-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟡 NEEDS ATTENTION

Two LIKELY medium-severity concerns in the new pprof server implementation. One drafter hypothesis (custom mux missing heap/goroutine routes) was dismissed by verification — pprof.Index correctly serves named runtime profiles via URL path parsing, so the five explicit registrations are sufficient and correct.

Comment thread cmd/root/api.go
}

if pprofAddr := cmp.Or(f.pprofAddr, os.Getenv("CAGENT_PPROF_ADDR")); pprofAddr != "" {
if err := profiling.StartPprofServer(ctx, pprofAddr); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] pprof bind failure aborts the entire serve api command

If the address set via --pprof-addr / CAGENT_PPROF_ADDR is unavailable (e.g. port already in use, address already bound by another process, or a typo), StartPprofServer returns an error that propagates here — preventing the main API listener from ever starting.

This is a disproportionate failure mode for a debug-only facility: a misconfigured CAGENT_PPROF_ADDR environment variable in a production deployment would silently take the entire agent down. Consider logging the error and continuing rather than returning it fatally, or at minimum documenting that the env-var can cause a hard service outage:

if err := profiling.StartPprofServer(ctx, pprofAddr); err != nil {
    // OPTION A – degrade gracefully
    slog.WarnContext(ctx, "pprof server failed to start; continuing without profiling", "error", err)
    // OPTION B – keep fatal, but make the impact explicit in the flag help text
}

srv := &http.Server{Handler: mux, ReadHeaderTimeout: 10 * time.Second}

slog.InfoContext(ctx, "pprof server listening", "addr", ln.Addr().String())
if tcpAddr, ok := ln.Addr().(*net.TCPAddr); ok && !tcpAddr.IP.IsLoopback() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Non-loopback pprof binding warns but does not refuse — unauthenticated memory access over the network

When a non-loopback address (e.g. 0.0.0.0:6060) is used, the warning is logged and the server starts anyway. Unlike the main API listener — which requires bearer-token authentication — pprof requires no authentication and exposes:

  • /debug/pprof/heap — full heap dump (may contain in-memory secrets, tokens, TLS keys)
  • /debug/pprof/cmdline — process command-line arguments (may contain passwords)
  • /debug/pprof/goroutine — goroutine stacks (code paths, internal state)

The existing warnIfNotLoopback for the main API listener is a reasonable pattern because that listener is authenticated. For an unauthenticated endpoint, a warn-and-proceed policy means any actor who can set CAGENT_PPROF_ADDR=0.0.0.0:6060 in the environment (e.g. via a Docker Compose override or a K8s config map) can read process memory from the network with no further credentials.

Consider either rejecting non-loopback addresses outright, or making the security implication explicit in user-facing documentation beyond the hidden flag help text.

@zampani-docker zampani-docker marked this pull request as ready for review June 22, 2026 16:52
@zampani-docker zampani-docker requested a review from a team as a code owner June 22, 2026 16:52
@dgageot dgageot merged commit 372c1f8 into docker:main Jun 22, 2026
11 checks passed

@docker-agent docker-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟢 APPROVE

The pprof server implementation is well-structured: synchronous bind (fatal on failure), dedicated http.ServeMux (no DefaultServeMux pollution in practice), loopback warning matching existing project patterns, ReadHeaderTimeout for slow-loris protection, and correct context.WithoutCancel usage for shutdown. One minor logging inconsistency found (see inline comment).

Reviewed: cmd/root/api.go, pkg/profiling/pprof_server.go


go func() {
if err := srv.Serve(ln); err != nil && !errors.Is(err, http.ErrServerClosed) {
slog.Warn("pprof server error", "error", err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] slog.Warn used without context — loses trace IDs and structured context values

The serve goroutine calls slog.Warn(...) (context-free variant) while every other log statement in this file uses a context-aware variant (slog.InfoContext, slog.WarnContext). The goroutine closes over ctx, so the context is available and should be passed.

// Current (loses ctx values):
slog.Warn("pprof server error", "error", err)

// Should be:
slog.WarnContext(ctx, "pprof server error", "error", err)

This loses any trace IDs or structured context fields attached to ctx at the time of the error, making it harder to correlate this log entry with the request/session that triggered the failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants