Skip to content

feat(gooddata-sdk): deliver HLL / aggregate-aware LDM surfaces#1589

Merged
jaceksan merged 9 commits intomasterfrom
jacek/meta
May 7, 2026
Merged

feat(gooddata-sdk): deliver HLL / aggregate-aware LDM surfaces#1589
jaceksan merged 9 commits intomasterfrom
jacek/meta

Conversation

@jaceksan
Copy link
Copy Markdown
Contributor

@jaceksan jaceksan commented May 7, 2026

Summary

Brings the Python SDK in line with the gdc-nas + gdc-ui aggregate-aware LDM
work so HyperLogLog (HLL) workloads round-trip end-to-end.

Highlights:

  • api-client: regenerated against the AAC-enabled OpenAPI spec, picking up
    SourceReferenceIdentifier (polymorphic fact|attribute reference) and the
    new AUXILIARY / pre-aggregation / AI-Lake surfaces. Makefile hardened with
    NUL-byte stripping, jq cycle-break for the openapi-generator allOf cycles,
    and a Python post-processor under scripts/postprocess_api_client.py so
    future regenerations "just work".
  • catalog/identifier: CatalogFactIdentifier now writes
    SourceReferenceIdentifier, accepting FACT or ATTRIBUTE targets — the
    AFM-side counterpart of HLL APPROXIMATE_COUNT aggregated_facts.
  • declarative LDM: added a typed Literal["NORMAL", "AUXILIARY"] | None
    type field on datasets and relaxed source_column optionality on
    Attribute/Fact/Label so AUXILIARY datasets (which carry no physical mapping)
    round-trip cleanly. Documented Option 3 (factory constructors) as a future
    refactoring TODO.
  • AI Lake service: new catalog_ai_lake.analyze_statistics /
    get_operation / wait_for_operation with a typed
    CatalogAILakeOperation handle (OperationStatus = Literal["pending", "succeeded", "failed"]).
  • Org-level HLL_TYPE setting: set_hll_type / get_hll_type typed
    helpers (HLLType = Literal["Native", "Presto"]) over the generic
    organization-setting machinery, with a create-or-update fallback.
  • WASM: pinned gooddata-code-convertors>=11.35.0a2 — that prerelease
    carries the fix for APPROXIMATE_COUNT-of-attribute targets emitting
    reference.type="attribute" (was "fact" in 11.33.x/11.34.x).
  • Tests: 14 new unit tests across declarative-LDM, AAC round-trip,
    put-declarative-LDM, AI-Lake service, HLL_TYPE setting, and AFM
    APPROXIMATE_COUNT smoke. Suite goes from baseline → 443 passed, 2
    skipped, 0 xfailed
    .
  • Docs: Hugo pages for set_hll_type / get_hll_type under
    docs/content/en/latest/administration/organization/.
    <test_plan>- cd packages/gooddata-sdk && uv run --no-sync pytest tests/ — 443 passed,
    2 skipped (1m suite).
  • uv run --no-sync pytest tests/catalog/unit_tests/test_aac_agg_aware.py -v
    — 4 passed including the previously-xfailed APPROXIMATE_COUNT-of-attribute
    round-trip, now passing naturally on 11.35.0a2.
  • API-client regeneration verified end-to-end against staging OpenAPI
    endpoint; make _api-client-generate green on a clean tree.
  • Pre-commit hooks (ruff, ruff-format, copyright) green on every commit in
    the stack.</test_plan>

Test plan

JIRA: CQ-2320
risk: low

@jaceksan jaceksan requested review from hkad98, lupko and pcerny as code owners May 7, 2026 11:22
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.95%. Comparing base (efc6b2a) to head (27dca51).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1589      +/-   ##
==========================================
+ Coverage   78.83%   78.95%   +0.12%     
==========================================
  Files         230      231       +1     
  Lines       15486    15573      +87     
==========================================
+ Hits        12208    12296      +88     
+ Misses       3278     3277       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hkad98
Copy link
Copy Markdown
Contributor

hkad98 commented May 7, 2026

Why is there -378k of lines? It does not look right.

@jaceksan
Copy link
Copy Markdown
Contributor Author

jaceksan commented May 7, 2026

Why is there -378k of lines? It does not look right.

It is right. We removed AAC APIs from the backend; they are mastered in gdc-ui and derived WASM, which we use here.

jaceksan added 9 commits May 7, 2026 14:11
Regenerated `gooddata-api-client` from
`https://staging.dev-latest.stg11.panther.intgdc.com` to pick up the new
HLL-related backend surfaces:

- `DeclarativeSetting.type = HLL_TYPE` (org setting `hyperLogLogType`,
  values `Native` / `Presto`)
- `DeclarativeDataset.type = AUXILIARY` with associated constraint docs
- `CreatePipeTableRequest.column_expressions` (enables native HLL via
  `hll_hash()` at PIPE INSERT)
- `GET /api/v1/ailake/object-storages` →
  `AILakeDatabasesApi.list_ai_lake_object_storages`

`Makefile` and a new `scripts/postprocess_api_client.py` harden the
`_api-client-generate` target against two reproducible bugs in
`openapi-generator-cli:v6.6.0` that surfaced on this regen:

1. `recursiveGetDiscriminator` infinite recursion (StackOverflowError) on
   the new `DashboardCompoundConditionItem` ↔ children `oneOf`/`allOf`
   cycle. Worked around with an inline `jq` step that drops the redundant
   `allOf: [{$ref: parent}]` from each child (parent has no own
   properties, so this is semantically a no-op).
2. Generator mangling regex patterns of the form `^[^\x00]*$` — sometimes
   dropping the NUL (leaving the invalid `^[^]*$`), sometimes embedding a
   literal NUL byte that makes the Python source un-importable. The new
   helper script handles both shapes.

Also replaced the previous `sed '/ /d'` step (which was a no-op — sed
BRE/ERE doesn't interpret `\uNNNN`) with `tr -d '\000'` so any literal NUL
bytes that jq decoded from ` ` escapes in the source spec are
actually stripped.

Generation override:
`make api-client BASE_URL=https://staging.dev-latest.stg11.panther.intgdc.com`

JIRA: CQ-2320
risk: low
11.35.0a2 carries the WASM round-trip support for aggregate-aware LDMs:
AUXILIARY datasets, pre-aggregation `aggregated_facts`, synthesized
dim datasets backed by `sql:`, and `APPROXIMATE_COUNT` aggregations
whose `assigned_to` resolves to an attribute target (HLL synopses).

risk: low
Adopt the polymorphic source-reference type from gdc-nas CQ-2147
("enable agg fact based on attribute", commit `c5601cce`), which
replaced `FactIdentifier` with `SourceReferenceIdentifier` so that
`aggregatedFacts[].sourceFactReference.reference` can target either a
fact or an attribute. The attribute path is required for HLL
APPROXIMATE_COUNT, whose count target is an attribute on an AUX dataset.

Changes:
- `catalog/identifier.py`: `CatalogFactIdentifier.client_class` now
  returns `SourceReferenceIdentifier`. SDK class name kept for
  back-compat.
- `catalog/workspace/.../dataset.py`:
  `CatalogDeclarativeSourceFactReference.client_class` now returns
  `DeclarativeSourceReference`. SDK class name and the wrapping
  field name (`source_fact_reference`) kept for back-compat — the
  backend itself kept the JSON key `sourceFactReference`.

Knock-on visible to consumers: the new identifier's `type` enum uses
lowercase string values (`"fact"`, `"attribute"`) instead of the
previous uppercase. Not surfaced in release notes — agg-aware datasets
were not exercised by Python-SDK consumers prior to this delivery.

Verified: `pytest packages/gooddata-sdk/tests/` (excluding cassette-
backed `tests/catalog/store/`) → 413 passed, 2 skipped.

JIRA: CQ-2320
risk: low
Adds first-class support for AUXILIARY datasets — synthetic datasets
that pre-aggregation tables target as HLL/aggregate reference attributes
(the keystone of the aggregate-aware design). The api-client schema
distinguishes NORMAL vs AUXILIARY only by the `type` discriminator, so
the SDK does the same: a single `CatalogDeclarativeDataset` carries an
optional `type: Literal["NORMAL", "AUXILIARY"]` field, and AUX-only
omissions (no `dataSourceTableId`, no `sql`, no `aggregatedFacts`,
no `precedence`, no physical `source_column` on attributes/facts/labels)
are encoded by relaxing those fields to optional. The platform
validator enforces the type-specific exclusions; we don't duplicate
that here.

Knock-on changes folded in:

- `client_class()` uses `builtins.type[DeclarativeDataset]`, mirroring
  `setting.py`/`identifier.py`. The new `type` field on the dataclass
  shadows the builtin `type` inside the class body, which trips ty.
- `gooddata_dbt/dbt/metrics.py` guards `source_column.lower()` calls so
  AUX entries (no physical column) are skipped instead of crashing.

A future TODO documents Option 3 — typed factory constructors
(`CatalogDeclarativeDataset.normal(...)` / `.auxiliary(...)`) for safer
construction without splitting into two classes.

risk: low
Wrap the AI Lake long-running-operation surface with a typed
`CatalogAILakeService` exposing the three methods consumers need to
trigger and wait on `ANALYZE TABLE` after registering aggregate-aware
LDM shapes:

- `analyze_statistics(instance_id, table_names=None, operation_id=None)`
  → `str`: posts the request and returns the operation ID. The caller
  can pre-supply a UUID; otherwise the service generates one and seeds
  it as the request `operation-id` header so the polling handle is
  known up front (the endpoint returns `Unit` body + the id in the
  response header).
- `get_operation(operation_id)` → `CatalogAILakeOperation`: typed
  handle with `id`, `kind`, `status` (Literal "pending"/"succeeded"/
  "failed" — these are the discriminator values of the OpenAPI
  `Operation` oneOf), and optional `result`/`error` payloads. Convenience
  predicates `is_terminal`, `is_succeeded`, `is_failed`.
- `wait_for_operation(operation_id, timeout_s=300, poll_s=2)`: blocks
  until terminal, raises `CatalogAILakeOperationError` on `failed` and
  `TimeoutError` when the deadline elapses.

Wiring:
- `GoodDataApiClient.ai_lake_api` property now exposes `apis.AILakeApi`
  for callers who need raw access to surfaces this service does not
  yet wrap (database provisioning, pipe-table CRUD, service commands —
  follow-up tickets).
- `GoodDataSdk.catalog_ai_lake` property surfaces the new service.
- Public re-exports: `CatalogAILakeOperation`, `CatalogAILakeOperationError`,
  `CatalogAILakeService`.

Tests (8 tests, all unit-mocked — no live stack needed):
- analyze_statistics seeds caller-supplied UUID, generates one when
  omitted, normalizes empty `table_names`
- get_operation handles success and failure shapes
- wait_for_operation polls until succeeded, raises on failed,
  raises TimeoutError when never terminal

Verified: full SDK unit suite — 425 passed, 2 skipped.

JIRA: CQ-2320
risk: low
Add a regression-guard for `CatalogWorkspaceContentService.put_declarative_ldm`
covering aggregate-aware LDM shapes — extends the round-trip
verification to the next link in the chain (the call to
`_layout_api.set_logical_model`).

The historical risk is silent field loss between the SDK class and the
api-client model: an `attrs` default collapsing an explicit `[]` into
"missing", or a missing field in `attribute_map` after a regen.
The three tests pin down what must reach the wire:

- AUXILIARY dataset round-trips with its synthetic identity attribute
  and references to dim datasets, never gaining `precedence`/`sql`/
  `dataSourceTableId` on the way out.
- Pre-aggregation dataset keeps `precedence`, `dataSourceTableId`, and
  both flavors of `aggregatedFacts.sourceFactReference` — SUM-of-fact
  AND APPROXIMATE_COUNT-of-attribute (the HLL load-bearing shape).
- Synthesized dim keeps its `sql.statement` and the attribute's
  `sourceColumn` mapping.

The fixture is shared with the prior round-trip test, so the two stay
aligned on what an agg-aware shape looks like.

No SDK code change needed — `put_declarative_ldm` already handles these
shapes correctly thanks to the recent dataset-class additions; this
test ensures it stays that way.

JIRA: CQ-2320
risk: low
YAML ↔ declarative round-trip tests for the three new aggregate-aware
shapes the WASM convertor handles end-to-end on 11.35.0a2:

- AUXILIARY datasets (no physical mapping; synthetic identity attrs).
- NORMAL pre-aggregation datasets with `aggregated_facts` — both the
  vanilla SUM-of-fact path and APPROXIMATE_COUNT-of-attribute (HLL
  synopses targeting the AUX identity attribute, requires
  `reference.type == "attribute"` per gdc-nas CQ-2147).
- NORMAL synthesized dim datasets backed by a `sql:` block.

These guard the SDK side of the AAC convertor pipeline; the heavy
lifting lives in `gooddata-code-convertors`.

risk: low
Add typed `set_hll_type` / `get_hll_type` methods on
`CatalogOrganizationService` for the new `hyperLogLogType` org setting
introduced by gdc-nas (`HLL_TYPE("hyperLogLogType",
SettingConfiguration.HyperLogLogType)`).

The setting controls which HLL function family calcique uses when
generating SQL over HLL synopses:

- `"Native"` (default) emits StarRocks-native `HLL_*` functions —
  works when the platform (or PIPE pipeline) builds synopses
  StarRocks-side.
- `"Presto"` emits Presto-compatible HLL functions, required when
  synopses arrive from an upstream Presto pipeline (the binary layout
  and hash family differ between the two). Requires the StarRocks
  deployment to carry the Presto HLL UDFs.

Surface:
- `HLLType` Literal alias (`"Native" | "Presto"`) re-exported from
  `gooddata_sdk` for consumer annotations.
- `HLL_TYPE_SETTING_ID` (`"hyperLogLogType"`) and
  `HLL_TYPE_SETTING_TYPE` (`"HLL_TYPE"`) constants for callers that
  prefer to drive the generic `CatalogOrganizationSetting.init(...)`
  path directly.
- `service.set_hll_type(value)` is idempotent — tries update first,
  falls back to create when the setting doesn't exist yet.
- `service.get_hll_type()` returns `HLLType | None`; defensively
  returns `None` for unrecognized stored values.

Tests: 7 unit tests with mocked `entities_api` — no live stack needed.
Cover create-on-missing, update-on-existing, both `"Native"` and
`"Presto"` reads, absent-setting case, and the public `HLLType` alias.

Verified: full SDK unit suite — 438 passed, 2 skipped, 1 xfailed.

JIRA: CQ-2320
risk: low
Public docs for the new surfaces this branch introduces:

- `data/ai-lake/`: new section with `_index.md` and method pages for
  `analyze_statistics`, `get_operation`, and `wait_for_operation`.
  Documents the long-running-operation contract callers see when they
  drive AI Lake actions from the SDK (today: ANALYZE TABLE for
  pre-aggregation tables; the surface will grow as more actions are
  wrapped in typed helpers).
- `administration/organization/set_hll_type.md` and
  `get_hll_type.md`, with the organization `_index.md` sidebar updated
  to link them.
- AFM compute-model smoke tests for `APPROXIMATE_COUNT` aggregation in
  `SimpleMetric`.

risk: low
@hkad98
Copy link
Copy Markdown
Contributor

hkad98 commented May 7, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@jaceksan jaceksan merged commit cd91b31 into master May 7, 2026
13 checks passed
@jaceksan jaceksan deleted the jacek/meta branch May 7, 2026 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants