Skip to content

fix(charts): cap group-by time-series to top-N series to prevent OOM#2429

Open
wrn14897 wants to merge 10 commits into
mainfrom
sao-paulo
Open

fix(charts): cap group-by time-series to top-N series to prevent OOM#2429
wrn14897 wants to merge 10 commits into
mainfrom
sao-paulo

Conversation

@wrn14897

@wrn14897 wrn14897 commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

High-cardinality group-by time charts could pull hundreds of thousands of series into a single tile — only ~60 were ever drawn (HARD_LINES_LIMIT), but every series was still fetched, JSON-parsed, and zero-filled into a dense buckets×series matrix in memory, OOMing the browser tab. This adds an opt-in query-level seriesLimit: when set on a group-by + granularity chart, renderChartConfig emits a CTE (__hdx_series_limit) that keeps only the top-N series ranked by max value in any bucket and restricts the outer query to those groups (mirroring the existing aggFn=increase TopGroups path). The time-chart display path (convertToTimeChartConfig) sets this cap, while alerts and other renderChartConfig consumers leave it unset so their series evaluation is unchanged.

The cap is also configurable per team: a new seriesLimit team setting (alongside the existing ClickHouse client settings) lets teams tune it, defaulting to 100 when unset and floored at 1 (no upper bound, so teams can intentionally fetch more series). The default and the rendered-line cap (HARD_LINES_LIMIT) move together; a team value only governs how many series are fetched, with any surplus available in the series selector. Verified end-to-end against real ClickHouse (50 series → top 5) plus unit tests; all existing SQL snapshots are unchanged.

How to test on Vercel preview

Preview routes: /team

Steps:

  1. Open /team.
  2. Scroll to the "ClickHouse Client Settings" section.
  3. Verify a "Time Chart Series Limit" setting is listed with a default of 100.

References

  • Linear Issue: HDX-4499
  • Related PRs:

High-cardinality group-by time charts could pull hundreds of thousands of
series into a single tile (only ~60 were ever drawn), zero-filling a dense
buckets×series matrix in memory and OOMing the browser tab.

Add an opt-in query-level `seriesLimit`: when set on a group-by + granularity
chart, renderChartConfig emits a TopGroups CTE that keeps only the top-N series
by max value in any bucket and restricts the outer query to those groups. The
time-chart display path sets seriesLimit=60 (matching HARD_LINES_LIMIT); alerts
and other renderChartConfig consumers leave it unset, so their series evaluation
is unchanged.
@changeset-bot

changeset-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 072d3ad

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages
Name Type
@hyperdx/common-utils Patch
@hyperdx/app Patch
@hyperdx/api Patch
@hyperdx/otel-collector Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment Jun 9, 2026 7:09pm
hyperdx-storybook Ready Ready Preview, Comment Jun 9, 2026 7:09pm

Request Review

@github-actions github-actions Bot added the review/tier-3 Standard — full human review required label Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Critical-path files (1):
    • packages/api/src/models/team.ts
  • Cross-layer change: touches frontend (packages/app) + backend (packages/api) + shared utils (packages/common-utils)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 9
  • Production lines changed: 152 (+ 478 in test files, excluded from tier calculation)
  • Branch: sao-paulo
  • Author: wrn14897

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown

Greptile Summary

This PR introduces an opt-in query-level seriesLimit that caps high-cardinality group-by time charts to a top-N series via a ClickHouse CTE (__hdx_series_limit), preventing browser OOM from dense data matrices. The cap defaults to 100 and is configurable per team via a new setting in the Team ClickHouse settings UI.

  • Core SQL change (renderSeriesLimitCte): emits a two-level aggregation CTE that ranks groups by their max per-bucket value, restricting the outer query's WHERE to the top-N groups via a tuple(...) IN (...) predicate. Guards gate it off for metric sources, missing group-by/granularity, and string selects.
  • HARD_LINES_LIMIT raised from 60 → 100 (HDXMultiSeriesTimeChart): now tied to DEFAULT_SERIES_LIMIT so fetch cap and render cap move together; teams with many series will now render up to 100 SVG elements per tile instead of 60.
  • Team settings UI and Mongoose model wired to expose and persist seriesLimit per team, with comprehensive unit and integration tests covering multi-column group-bys, NULL exclusion, alias stripping, and string group-bys.

Confidence Score: 5/5

Safe to merge — the CTE logic is well-guarded and the end-to-end test suite covers the tricky edge cases.

The series-cap CTE is carefully guarded against every edge case (metric sources, missing group-by/granularity, string selects, selectGroupBy === false), aliases are stripped before use in tuple/IS NOT NULL contexts, and the WHERE ordering ensures the CTE sees the original pre-predicate conditions. Integration tests run against real ClickHouse confirm top-N ranking, NULL exclusion, Map-key group-bys, and alias preservation in output. No correctness issues were found.

No files require special attention.

Important Files Changed

Filename Overview
packages/common-utils/src/core/renderChartConfig.ts New renderSeriesLimitCte function correctly guards against metric sources, string selects, missing group-by/granularity, and selectGroupBy === false; aliases are stripped before tuple/IS NOT NULL; WHERE in CTE inner query correctly uses the original (pre-predicate) where clause.
packages/app/src/HDXMultiSeriesTimeChart.tsx HARD_LINES_LIMIT raised from 60 to 100 (tied to DEFAULT_SERIES_LIMIT); intentional per PR description so fetch cap and render cap move together.
packages/app/src/ChartUtils.tsx convertToTimeChartConfig correctly floors seriesLimit at 1 and only injects it into builder configs; non-builder SQL configs are unaffected.
packages/app/src/components/DBTimeChart.tsx api.useMe() call moved before useMemo; query correctly gated behind !isLoadingMe so seriesLimit is stable before first fetch.
packages/app/src/components/SearchTotalCountChart.tsx api.useMe() moved before useMemo for consistency; enabled: !isLoadingMe guard preserved, maintaining query de-duplication with DBTimeChart.
packages/common-utils/src/tests/queryChartConfig.int.test.ts Four new integration tests cover: basic top-N ranking, multi-column string group-by with Map access, NULL exclusion, and alias stripping inside CTEs — all running against real ClickHouse.
packages/common-utils/src/tests/renderChartConfig.test.ts Seven new unit tests verify CTE emission, tuple packing, null filtering, alias stripping, string group-by splitting, and metric-source gating; all existing SQL snapshots unchanged.
packages/common-utils/src/types.ts seriesLimit added to SelectSQLStatementSchema and TeamClickHouseSettingsSchema/Update with z.number().int().positive() validation; consistent with existing optional field patterns.
packages/app/src/components/TeamSettings/TeamQueryConfigSection.tsx New ClickhouseSettingForm for seriesLimit follows the same pattern as existing settings with correct type, min, and default value.
packages/api/src/models/team.ts seriesLimit: Number added to the team Mongoose schema alongside the other ClickHouse client settings; no migration needed for an optional field.

Reviews (11): Last reviewed commit: "chore(charts): remove non-essential comm..." | Re-trigger Greptile

Comment thread packages/common-utils/src/core/renderChartConfig.ts Outdated
Comment thread packages/common-utils/src/core/renderChartConfig.ts Outdated
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

E2E Test Results

All tests passed • 200 passed • 3 skipped • 1262s

Status Count
✅ Passed 200
❌ Failed 0
⚠️ Flaky 2
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@github-actions github-actions Bot added review/tier-4 Critical — deep review + domain expert sign-off and removed review/tier-3 Standard — full human review required labels Jun 9, 2026
Add a `seriesLimit` team setting alongside the existing ClickHouse client
settings so teams can tune the top-N series cap that bounds time-chart memory.
`convertToTimeChartConfig` now reads `me.team.seriesLimit`, falling back to the
default (60) when unset and floored at 1. No upper bound — teams may
intentionally fetch more series than the 60 rendered at once (the surplus is
available in the series selector).

The setting flows through the existing team-settings pipeline (Zod schema,
Mongoose model, /me, PATCH /team/clickhouse-settings, ClickhouseSettingForm); no
controller or endpoint changes were needed.
karl-power
karl-power previously approved these changes Jun 9, 2026

@karl-power karl-power left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just missing a changeset

…Config

Replace the `baseWithClauses` rename and two nested ternaries with a plain
`let withClauses`/`let where` plus a single `if (seriesCap)` fold. Behaviour is
unchanged — the ranking CTE is still appended to the already-rendered WITH
clause (rather than chartConfig.with, which would disable the main SELECT's
materialized-column optimization). Byte-identical SQL; all snapshots unchanged.
A multi-column string group-by (e.g. "LogAttributes['cap'],ServiceName") made
the series-cap ranking CTE emit an invalid two-argument toString() and a
malformed NULL check, because renderSelectList returns a comma-joined string as
a single expression. Split it into per-column expressions with
splitAndTrimWithBracket (which respects []/()/quotes) so the NULL/empty filter
applies per column. Array and single-column-string group-bys are unchanged
(snapshots unchanged); covered by a renderChartConfig unit test and a real-CH
integration test mirroring the failing Map-access case.
…es cap

The series-cap ranking CTE previously dropped both NULL and empty-string group
values. Empty-string groups (e.g. a missing Map key like
LogAttributes['x'] -> '') are real data, so silently hiding them was surprising.
Now only NULL components are excluded — which is the genuine technical need,
since the outer `tuple(...) IN (...)` is NULL-unsafe (transform_null_in=0) and
would otherwise waste a top-N slot on a group it can't match. Empty-string
groups now compete for a slot like any other value.
If a group-by item carried an alias, renderSelectList appended ` AS "alias"`,
which then leaked into tuple(...) and `(... IS NOT NULL)` in the series-cap CTE —
both invalid SQL there (the outer GROUP BY tolerates aliases, these positions do
not). Strip the alias before building the tuple and null filter, matching how the
rank value is already rendered. No alias is set on group-bys today, so this is a
defensive fix; covered by a new unit test.
…rray, metric gating)

Add integration tests against real ClickHouse for the corner cases the recent
series-cap fixes target but unit tests can't fully prove:
- NULL group component excluded (NULL-unsafe `tuple() IN (...)` would otherwise
  waste the only top-N slot and yield an empty chart);
- multi-column *array* group-by with an alias (proves the 2-col tuple()/IN
  executes and the alias is stripped inside the CTE yet preserved in output).

Add a unit test that a metric source does not emit the series-cap CTE (gating).
Comment thread packages/app/src/components/TeamSettings/TeamQueryConfigSection.tsx Outdated
Comment thread packages/app/src/ChartUtils.tsx Outdated
Comment thread packages/common-utils/src/types.ts Outdated
Comment thread packages/common-utils/src/types.ts Outdated
Comment thread packages/common-utils/src/core/renderChartConfig.ts Outdated
Bumps DEFAULT_SERIES_LIMIT 60 -> 100. Since MAX_TIME_CHART_SERIES and
HARD_LINES_LIMIT derive from it, this raises both the default query-time cap and
the rendered-line cap (and ServicesDashboard's MAX_NUM_SERIES) to 100. Teams can
still override the query cap via the seriesLimit setting.
… tooltip

Drop the explanatory comments added across the series-cap changes (keeping only
the ones inside renderSeriesLimitCte), and simplify the "Time Chart Series
Limit" tooltip to "Maximum number of series fetched per time chart."
@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown

Want your agent to iterate on Greptile's feedback? Try greploops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants