Skip to content

[Bug]: Logging guide: wrong namespace, missing container guidance, outdated shipper examples #625

@danbarr

Description

@danbarr

Summary

The Kubernetes logging guide has several inaccuracies and gaps that make it difficult to set up log collection in practice. This was discovered while setting up log collection for ToolHive audit logs.

The vMCP audit logging guide is more accurate (it uses the correct toolhive-system namespace and has a proper configuration reference table), but shares issues 2, 3, and 5 below. Issue 4 does not apply to vMCP as those containers emit pure JSON with no plain-text startup line.

Issues found

1. Wrong default namespace in examples (operator docs only)

The guide's filtering examples (Fluentd, Filebeat, Splunk) all appear to assume the namespace is toolhive, but the default namespace used by the ToolHive operator is toolhive-system. Any log pipeline that filters by namespace path (e.g. /var/log/pods/toolhive_* in a filelog glob) will silently collect nothing.

Suggested fix: Update all namespace references in examples to toolhive-system, or note that the namespace may vary and show how to verify it (kubectl get pods -n toolhive-system).

2. No guidance on which container to target (both docs)

ToolHive pods have different container names depending on the component:

  • MCP proxy Deployment pods: container named toolhive — emits structured JSON audit logs
  • vMCP pods: container named vmcp — emits structured JSON audit logs
  • StatefulSet pods (-0 suffix, older generation): container named mcp — raw MCP server, emits plain text

The guides don't mention container names at all. Without targeting the right container (e.g. /var/log/containers/*_toolhive-system_toolhive-*.log), collectors pick up MCP server stdout (plain text, not JSON) and JSON parsing fails silently.

3. Outdated log shipper examples (both docs)

The guides cover Fluentd, Filebeat, and Splunk. These are not the tools most Kubernetes users reach for today. More relevant examples would be:

  • Grafana Alloy — the current recommended Grafana log collector (Promtail is deprecated in favour of Alloy). The natural companion to Loki, which is the default dashboard backend for ToolHive's own observability stack.
  • FluentBit — the most widely deployed lightweight log scraper in Kubernetes, with first-class JSON and CRI format support.

Working FluentBit config requires more pipeline stages than the existing examples suggest: a CRI parser on the tail input, a Kubernetes filter for metadata enrichment, a second JSON parser filter to promote the log body fields (including level) to the top level of the record, and a service label for Grafana Explore Logs grouping. The docs should show a complete, working example rather than a partial snippet.

The OTel Collector (filelog receiver) is an option when teams are already running it as an observability hub, but its file scraping support has rough edges and it is not commonly used as a primary log scraper. If included, it should be framed as a secondary "if you're already running OTel" option.

4. Undocumented plain-text startup line breaks JSON parsers (operator docs only)

Every toolhive proxy container emits one plain-text (non-JSON) log line on startup:

Workload started successfully. Press Ctrl+C to stop.

This line is not JSON, so any log pipeline configured to parse JSON will encounter a parse error on it. The guide doesn't mention this, so users setting up log collection will hit an unexpected error on every pod start/restart with no clear explanation.

Suggested fix: Add a note such as: "Note: each proxy pod emits a single plain-text startup line (Workload started successfully. Press Ctrl+C to stop.) before structured JSON logging begins. Log pipelines should be configured to tolerate or drop non-JSON lines."

The root fix for this is tracked in stacklok/toolhive#4295 — once that lands, the startup line will be structured JSON and this caveat can be removed from the docs.

5. vMCP audit events use a non-standard log level (both docs)

vMCP audit events are emitted at "level":"INFO+2" — a Go log/slog artifact rather than a standard named level. This means audit events show as unknown level in Loki's detected_level field and in Grafana's Explore Logs view, and cannot be filtered by level in any standard log pipeline.

The docs do not mention this. Users who set up level-based filtering or alerting on audit events will find they are silently excluded from level-filtered views.

Suggested fix: Add a note that audit events currently appear at level INFO+2 and that level-based filters should account for this. The root fix is tracked in stacklok/toolhive#4296, which proposes logging audit events at a standard level.

Impact

These gaps mean a user following the guide is likely to end up with a log pipeline that either silently collects nothing (wrong namespace/container), errors on startup (plain-text line), or loses audit events from level-filtered views (INFO+2 level). The structured logging feature is not usable from the docs alone without significant trial and error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdocumentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions