feat(gooddata-sdk): deliver HLL / aggregate-aware LDM surfaces by jaceksan · Pull Request #1589 · gooddata/gooddata-python-sdk

jaceksan · 2026-05-07T11:22:39Z

Summary

Brings the Python SDK in line with the gdc-nas + gdc-ui aggregate-aware LDM
work so HyperLogLog (HLL) workloads round-trip end-to-end.

Highlights:

api-client: regenerated against the AAC-enabled OpenAPI spec, picking up
SourceReferenceIdentifier (polymorphic fact|attribute reference) and the
new AUXILIARY / pre-aggregation / AI-Lake surfaces. Makefile hardened with
NUL-byte stripping, jq cycle-break for the openapi-generator allOf cycles,
and a Python post-processor under scripts/postprocess_api_client.py so
future regenerations "just work".
catalog/identifier: CatalogFactIdentifier now writes
SourceReferenceIdentifier, accepting FACT or ATTRIBUTE targets — the
AFM-side counterpart of HLL APPROXIMATE_COUNT aggregated_facts.
declarative LDM: added a typed Literal["NORMAL", "AUXILIARY"] | None
type field on datasets and relaxed source_column optionality on
Attribute/Fact/Label so AUXILIARY datasets (which carry no physical mapping)
round-trip cleanly. Documented Option 3 (factory constructors) as a future
refactoring TODO.
AI Lake service: new catalog_ai_lake.analyze_statistics /
get_operation / wait_for_operation with a typed
CatalogAILakeOperation handle (OperationStatus = Literal["pending", "succeeded", "failed"]).
Org-level HLL_TYPE setting: set_hll_type / get_hll_type typed
helpers (HLLType = Literal["Native", "Presto"]) over the generic
organization-setting machinery, with a create-or-update fallback.
WASM: pinned gooddata-code-convertors>=11.35.0a2 — that prerelease
carries the fix for APPROXIMATE_COUNT-of-attribute targets emitting
reference.type="attribute" (was "fact" in 11.33.x/11.34.x).
Tests: 14 new unit tests across declarative-LDM, AAC round-trip,
put-declarative-LDM, AI-Lake service, HLL_TYPE setting, and AFM
APPROXIMATE_COUNT smoke. Suite goes from baseline → 443 passed, 2
skipped, 0 xfailed.
Docs: Hugo pages for set_hll_type / get_hll_type under
docs/content/en/latest/administration/organization/.
<test_plan>- cd packages/gooddata-sdk && uv run --no-sync pytest tests/ — 443 passed,
2 skipped (1m suite).
uv run --no-sync pytest tests/catalog/unit_tests/test_aac_agg_aware.py -v
— 4 passed including the previously-xfailed APPROXIMATE_COUNT-of-attribute
round-trip, now passing naturally on 11.35.0a2.
API-client regeneration verified end-to-end against staging OpenAPI
endpoint; make _api-client-generate green on a clean tree.
Pre-commit hooks (ruff, ruff-format, copyright) green on every commit in
the stack.</test_plan>

Test plan

JIRA: CQ-2320
risk: low

codecov · 2026-05-07T11:26:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.95%. Comparing base (efc6b2a) to head (27dca51).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1589      +/-   ##
==========================================
+ Coverage   78.83%   78.95%   +0.12%     
==========================================
  Files         230      231       +1     
  Lines       15486    15573      +87     
==========================================
+ Hits        12208    12296      +88     
+ Misses       3278     3277       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hkad98 · 2026-05-07T11:51:00Z

Why is there -378k of lines? It does not look right.

jaceksan · 2026-05-07T12:03:37Z

Why is there -378k of lines? It does not look right.

It is right. We removed AAC APIs from the backend; they are mastered in gdc-ui and derived WASM, which we use here.

Regenerated `gooddata-api-client` from `https://staging.dev-latest.stg11.panther.intgdc.com` to pick up the new HLL-related backend surfaces: - `DeclarativeSetting.type = HLL_TYPE` (org setting `hyperLogLogType`, values `Native` / `Presto`) - `DeclarativeDataset.type = AUXILIARY` with associated constraint docs - `CreatePipeTableRequest.column_expressions` (enables native HLL via `hll_hash()` at PIPE INSERT) - `GET /api/v1/ailake/object-storages` → `AILakeDatabasesApi.list_ai_lake_object_storages` `Makefile` and a new `scripts/postprocess_api_client.py` harden the `_api-client-generate` target against two reproducible bugs in `openapi-generator-cli:v6.6.0` that surfaced on this regen: 1. `recursiveGetDiscriminator` infinite recursion (StackOverflowError) on the new `DashboardCompoundConditionItem` ↔ children `oneOf`/`allOf` cycle. Worked around with an inline `jq` step that drops the redundant `allOf: [{$ref: parent}]` from each child (parent has no own properties, so this is semantically a no-op). 2. Generator mangling regex patterns of the form `^[^\x00]*$` — sometimes dropping the NUL (leaving the invalid `^[^]*$`), sometimes embedding a literal NUL byte that makes the Python source un-importable. The new helper script handles both shapes. Also replaced the previous `sed '/ /d'` step (which was a no-op — sed BRE/ERE doesn't interpret `\uNNNN`) with `tr -d '\000'` so any literal NUL bytes that jq decoded from ` ` escapes in the source spec are actually stripped. Generation override: `make api-client BASE_URL=https://staging.dev-latest.stg11.panther.intgdc.com` JIRA: CQ-2320 risk: low

11.35.0a2 carries the WASM round-trip support for aggregate-aware LDMs: AUXILIARY datasets, pre-aggregation `aggregated_facts`, synthesized dim datasets backed by `sql:`, and `APPROXIMATE_COUNT` aggregations whose `assigned_to` resolves to an attribute target (HLL synopses). risk: low

Adopt the polymorphic source-reference type from gdc-nas CQ-2147 ("enable agg fact based on attribute", commit `c5601cce`), which replaced `FactIdentifier` with `SourceReferenceIdentifier` so that `aggregatedFacts[].sourceFactReference.reference` can target either a fact or an attribute. The attribute path is required for HLL APPROXIMATE_COUNT, whose count target is an attribute on an AUX dataset. Changes: - `catalog/identifier.py`: `CatalogFactIdentifier.client_class` now returns `SourceReferenceIdentifier`. SDK class name kept for back-compat. - `catalog/workspace/.../dataset.py`: `CatalogDeclarativeSourceFactReference.client_class` now returns `DeclarativeSourceReference`. SDK class name and the wrapping field name (`source_fact_reference`) kept for back-compat — the backend itself kept the JSON key `sourceFactReference`. Knock-on visible to consumers: the new identifier's `type` enum uses lowercase string values (`"fact"`, `"attribute"`) instead of the previous uppercase. Not surfaced in release notes — agg-aware datasets were not exercised by Python-SDK consumers prior to this delivery. Verified: `pytest packages/gooddata-sdk/tests/` (excluding cassette- backed `tests/catalog/store/`) → 413 passed, 2 skipped. JIRA: CQ-2320 risk: low

Adds first-class support for AUXILIARY datasets — synthetic datasets that pre-aggregation tables target as HLL/aggregate reference attributes (the keystone of the aggregate-aware design). The api-client schema distinguishes NORMAL vs AUXILIARY only by the `type` discriminator, so the SDK does the same: a single `CatalogDeclarativeDataset` carries an optional `type: Literal["NORMAL", "AUXILIARY"]` field, and AUX-only omissions (no `dataSourceTableId`, no `sql`, no `aggregatedFacts`, no `precedence`, no physical `source_column` on attributes/facts/labels) are encoded by relaxing those fields to optional. The platform validator enforces the type-specific exclusions; we don't duplicate that here. Knock-on changes folded in: - `client_class()` uses `builtins.type[DeclarativeDataset]`, mirroring `setting.py`/`identifier.py`. The new `type` field on the dataclass shadows the builtin `type` inside the class body, which trips ty. - `gooddata_dbt/dbt/metrics.py` guards `source_column.lower()` calls so AUX entries (no physical column) are skipped instead of crashing. A future TODO documents Option 3 — typed factory constructors (`CatalogDeclarativeDataset.normal(...)` / `.auxiliary(...)`) for safer construction without splitting into two classes. risk: low

Wrap the AI Lake long-running-operation surface with a typed `CatalogAILakeService` exposing the three methods consumers need to trigger and wait on `ANALYZE TABLE` after registering aggregate-aware LDM shapes: - `analyze_statistics(instance_id, table_names=None, operation_id=None)` → `str`: posts the request and returns the operation ID. The caller can pre-supply a UUID; otherwise the service generates one and seeds it as the request `operation-id` header so the polling handle is known up front (the endpoint returns `Unit` body + the id in the response header). - `get_operation(operation_id)` → `CatalogAILakeOperation`: typed handle with `id`, `kind`, `status` (Literal "pending"/"succeeded"/ "failed" — these are the discriminator values of the OpenAPI `Operation` oneOf), and optional `result`/`error` payloads. Convenience predicates `is_terminal`, `is_succeeded`, `is_failed`. - `wait_for_operation(operation_id, timeout_s=300, poll_s=2)`: blocks until terminal, raises `CatalogAILakeOperationError` on `failed` and `TimeoutError` when the deadline elapses. Wiring: - `GoodDataApiClient.ai_lake_api` property now exposes `apis.AILakeApi` for callers who need raw access to surfaces this service does not yet wrap (database provisioning, pipe-table CRUD, service commands — follow-up tickets). - `GoodDataSdk.catalog_ai_lake` property surfaces the new service. - Public re-exports: `CatalogAILakeOperation`, `CatalogAILakeOperationError`, `CatalogAILakeService`. Tests (8 tests, all unit-mocked — no live stack needed): - analyze_statistics seeds caller-supplied UUID, generates one when omitted, normalizes empty `table_names` - get_operation handles success and failure shapes - wait_for_operation polls until succeeded, raises on failed, raises TimeoutError when never terminal Verified: full SDK unit suite — 425 passed, 2 skipped. JIRA: CQ-2320 risk: low

Add a regression-guard for `CatalogWorkspaceContentService.put_declarative_ldm` covering aggregate-aware LDM shapes — extends the round-trip verification to the next link in the chain (the call to `_layout_api.set_logical_model`). The historical risk is silent field loss between the SDK class and the api-client model: an `attrs` default collapsing an explicit `[]` into "missing", or a missing field in `attribute_map` after a regen. The three tests pin down what must reach the wire: - AUXILIARY dataset round-trips with its synthetic identity attribute and references to dim datasets, never gaining `precedence`/`sql`/ `dataSourceTableId` on the way out. - Pre-aggregation dataset keeps `precedence`, `dataSourceTableId`, and both flavors of `aggregatedFacts.sourceFactReference` — SUM-of-fact AND APPROXIMATE_COUNT-of-attribute (the HLL load-bearing shape). - Synthesized dim keeps its `sql.statement` and the attribute's `sourceColumn` mapping. The fixture is shared with the prior round-trip test, so the two stay aligned on what an agg-aware shape looks like. No SDK code change needed — `put_declarative_ldm` already handles these shapes correctly thanks to the recent dataset-class additions; this test ensures it stays that way. JIRA: CQ-2320 risk: low

YAML ↔ declarative round-trip tests for the three new aggregate-aware shapes the WASM convertor handles end-to-end on 11.35.0a2: - AUXILIARY datasets (no physical mapping; synthetic identity attrs). - NORMAL pre-aggregation datasets with `aggregated_facts` — both the vanilla SUM-of-fact path and APPROXIMATE_COUNT-of-attribute (HLL synopses targeting the AUX identity attribute, requires `reference.type == "attribute"` per gdc-nas CQ-2147). - NORMAL synthesized dim datasets backed by a `sql:` block. These guard the SDK side of the AAC convertor pipeline; the heavy lifting lives in `gooddata-code-convertors`. risk: low

Add typed `set_hll_type` / `get_hll_type` methods on `CatalogOrganizationService` for the new `hyperLogLogType` org setting introduced by gdc-nas (`HLL_TYPE("hyperLogLogType", SettingConfiguration.HyperLogLogType)`). The setting controls which HLL function family calcique uses when generating SQL over HLL synopses: - `"Native"` (default) emits StarRocks-native `HLL_*` functions — works when the platform (or PIPE pipeline) builds synopses StarRocks-side. - `"Presto"` emits Presto-compatible HLL functions, required when synopses arrive from an upstream Presto pipeline (the binary layout and hash family differ between the two). Requires the StarRocks deployment to carry the Presto HLL UDFs. Surface: - `HLLType` Literal alias (`"Native" | "Presto"`) re-exported from `gooddata_sdk` for consumer annotations. - `HLL_TYPE_SETTING_ID` (`"hyperLogLogType"`) and `HLL_TYPE_SETTING_TYPE` (`"HLL_TYPE"`) constants for callers that prefer to drive the generic `CatalogOrganizationSetting.init(...)` path directly. - `service.set_hll_type(value)` is idempotent — tries update first, falls back to create when the setting doesn't exist yet. - `service.get_hll_type()` returns `HLLType | None`; defensively returns `None` for unrecognized stored values. Tests: 7 unit tests with mocked `entities_api` — no live stack needed. Cover create-on-missing, update-on-existing, both `"Native"` and `"Presto"` reads, absent-setting case, and the public `HLLType` alias. Verified: full SDK unit suite — 438 passed, 2 skipped, 1 xfailed. JIRA: CQ-2320 risk: low

Public docs for the new surfaces this branch introduces: - `data/ai-lake/`: new section with `_index.md` and method pages for `analyze_statistics`, `get_operation`, and `wait_for_operation`. Documents the long-running-operation contract callers see when they drive AI Lake actions from the SDK (today: ANALYZE TABLE for pre-aggregation tables; the surface will grow as more actions are wrapped in typed helpers). - `administration/organization/set_hll_type.md` and `get_hll_type.md`, with the organization `_index.md` sidebar updated to link them. - AFM compute-model smoke tests for `APPROXIMATE_COUNT` aggregation in `SimpleMetric`. risk: low

hkad98 · 2026-05-07T13:05:11Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

jaceksan requested review from hkad98, lupko and pcerny as code owners May 7, 2026 11:22

jaceksan added 9 commits May 7, 2026 14:11

jaceksan force-pushed the jacek/meta branch from 47825a5 to 27dca51 Compare May 7, 2026 12:13

jaceksan enabled auto-merge May 7, 2026 12:25

hkad98 approved these changes May 7, 2026

View reviewed changes

jaceksan merged commit cd91b31 into master May 7, 2026
13 checks passed

jaceksan deleted the jacek/meta branch May 7, 2026 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gooddata-sdk): deliver HLL / aggregate-aware LDM surfaces#1589

feat(gooddata-sdk): deliver HLL / aggregate-aware LDM surfaces#1589
jaceksan merged 9 commits intomasterfrom
jacek/meta

jaceksan commented May 7, 2026

Uh oh!

codecov Bot commented May 7, 2026 •

edited

Loading

Uh oh!

hkad98 commented May 7, 2026

Uh oh!

jaceksan commented May 7, 2026

Uh oh!

hkad98 commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jaceksan commented May 7, 2026

Summary

Test plan

Uh oh!

codecov Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hkad98 commented May 7, 2026

Uh oh!

jaceksan commented May 7, 2026

Uh oh!

hkad98 commented May 7, 2026

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 7, 2026 •

edited

Loading