feat(gooddata-sdk): deliver HLL / aggregate-aware LDM surfaces#1589
Merged
feat(gooddata-sdk): deliver HLL / aggregate-aware LDM surfaces#1589
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1589 +/- ##
==========================================
+ Coverage 78.83% 78.95% +0.12%
==========================================
Files 230 231 +1
Lines 15486 15573 +87
==========================================
+ Hits 12208 12296 +88
+ Misses 3278 3277 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Contributor
|
Why is there -378k of lines? It does not look right. |
Contributor
Author
It is right. We removed AAC APIs from the backend; they are mastered in gdc-ui and derived WASM, which we use here. |
Regenerated `gooddata-api-client` from `https://staging.dev-latest.stg11.panther.intgdc.com` to pick up the new HLL-related backend surfaces: - `DeclarativeSetting.type = HLL_TYPE` (org setting `hyperLogLogType`, values `Native` / `Presto`) - `DeclarativeDataset.type = AUXILIARY` with associated constraint docs - `CreatePipeTableRequest.column_expressions` (enables native HLL via `hll_hash()` at PIPE INSERT) - `GET /api/v1/ailake/object-storages` → `AILakeDatabasesApi.list_ai_lake_object_storages` `Makefile` and a new `scripts/postprocess_api_client.py` harden the `_api-client-generate` target against two reproducible bugs in `openapi-generator-cli:v6.6.0` that surfaced on this regen: 1. `recursiveGetDiscriminator` infinite recursion (StackOverflowError) on the new `DashboardCompoundConditionItem` ↔ children `oneOf`/`allOf` cycle. Worked around with an inline `jq` step that drops the redundant `allOf: [{$ref: parent}]` from each child (parent has no own properties, so this is semantically a no-op). 2. Generator mangling regex patterns of the form `^[^\x00]*$` — sometimes dropping the NUL (leaving the invalid `^[^]*$`), sometimes embedding a literal NUL byte that makes the Python source un-importable. The new helper script handles both shapes. Also replaced the previous `sed '/ /d'` step (which was a no-op — sed BRE/ERE doesn't interpret `\uNNNN`) with `tr -d '\000'` so any literal NUL bytes that jq decoded from ` ` escapes in the source spec are actually stripped. Generation override: `make api-client BASE_URL=https://staging.dev-latest.stg11.panther.intgdc.com` JIRA: CQ-2320 risk: low
11.35.0a2 carries the WASM round-trip support for aggregate-aware LDMs: AUXILIARY datasets, pre-aggregation `aggregated_facts`, synthesized dim datasets backed by `sql:`, and `APPROXIMATE_COUNT` aggregations whose `assigned_to` resolves to an attribute target (HLL synopses). risk: low
Adopt the polymorphic source-reference type from gdc-nas CQ-2147
("enable agg fact based on attribute", commit `c5601cce`), which
replaced `FactIdentifier` with `SourceReferenceIdentifier` so that
`aggregatedFacts[].sourceFactReference.reference` can target either a
fact or an attribute. The attribute path is required for HLL
APPROXIMATE_COUNT, whose count target is an attribute on an AUX dataset.
Changes:
- `catalog/identifier.py`: `CatalogFactIdentifier.client_class` now
returns `SourceReferenceIdentifier`. SDK class name kept for
back-compat.
- `catalog/workspace/.../dataset.py`:
`CatalogDeclarativeSourceFactReference.client_class` now returns
`DeclarativeSourceReference`. SDK class name and the wrapping
field name (`source_fact_reference`) kept for back-compat — the
backend itself kept the JSON key `sourceFactReference`.
Knock-on visible to consumers: the new identifier's `type` enum uses
lowercase string values (`"fact"`, `"attribute"`) instead of the
previous uppercase. Not surfaced in release notes — agg-aware datasets
were not exercised by Python-SDK consumers prior to this delivery.
Verified: `pytest packages/gooddata-sdk/tests/` (excluding cassette-
backed `tests/catalog/store/`) → 413 passed, 2 skipped.
JIRA: CQ-2320
risk: low
Adds first-class support for AUXILIARY datasets — synthetic datasets that pre-aggregation tables target as HLL/aggregate reference attributes (the keystone of the aggregate-aware design). The api-client schema distinguishes NORMAL vs AUXILIARY only by the `type` discriminator, so the SDK does the same: a single `CatalogDeclarativeDataset` carries an optional `type: Literal["NORMAL", "AUXILIARY"]` field, and AUX-only omissions (no `dataSourceTableId`, no `sql`, no `aggregatedFacts`, no `precedence`, no physical `source_column` on attributes/facts/labels) are encoded by relaxing those fields to optional. The platform validator enforces the type-specific exclusions; we don't duplicate that here. Knock-on changes folded in: - `client_class()` uses `builtins.type[DeclarativeDataset]`, mirroring `setting.py`/`identifier.py`. The new `type` field on the dataclass shadows the builtin `type` inside the class body, which trips ty. - `gooddata_dbt/dbt/metrics.py` guards `source_column.lower()` calls so AUX entries (no physical column) are skipped instead of crashing. A future TODO documents Option 3 — typed factory constructors (`CatalogDeclarativeDataset.normal(...)` / `.auxiliary(...)`) for safer construction without splitting into two classes. risk: low
Wrap the AI Lake long-running-operation surface with a typed `CatalogAILakeService` exposing the three methods consumers need to trigger and wait on `ANALYZE TABLE` after registering aggregate-aware LDM shapes: - `analyze_statistics(instance_id, table_names=None, operation_id=None)` → `str`: posts the request and returns the operation ID. The caller can pre-supply a UUID; otherwise the service generates one and seeds it as the request `operation-id` header so the polling handle is known up front (the endpoint returns `Unit` body + the id in the response header). - `get_operation(operation_id)` → `CatalogAILakeOperation`: typed handle with `id`, `kind`, `status` (Literal "pending"/"succeeded"/ "failed" — these are the discriminator values of the OpenAPI `Operation` oneOf), and optional `result`/`error` payloads. Convenience predicates `is_terminal`, `is_succeeded`, `is_failed`. - `wait_for_operation(operation_id, timeout_s=300, poll_s=2)`: blocks until terminal, raises `CatalogAILakeOperationError` on `failed` and `TimeoutError` when the deadline elapses. Wiring: - `GoodDataApiClient.ai_lake_api` property now exposes `apis.AILakeApi` for callers who need raw access to surfaces this service does not yet wrap (database provisioning, pipe-table CRUD, service commands — follow-up tickets). - `GoodDataSdk.catalog_ai_lake` property surfaces the new service. - Public re-exports: `CatalogAILakeOperation`, `CatalogAILakeOperationError`, `CatalogAILakeService`. Tests (8 tests, all unit-mocked — no live stack needed): - analyze_statistics seeds caller-supplied UUID, generates one when omitted, normalizes empty `table_names` - get_operation handles success and failure shapes - wait_for_operation polls until succeeded, raises on failed, raises TimeoutError when never terminal Verified: full SDK unit suite — 425 passed, 2 skipped. JIRA: CQ-2320 risk: low
Add a regression-guard for `CatalogWorkspaceContentService.put_declarative_ldm` covering aggregate-aware LDM shapes — extends the round-trip verification to the next link in the chain (the call to `_layout_api.set_logical_model`). The historical risk is silent field loss between the SDK class and the api-client model: an `attrs` default collapsing an explicit `[]` into "missing", or a missing field in `attribute_map` after a regen. The three tests pin down what must reach the wire: - AUXILIARY dataset round-trips with its synthetic identity attribute and references to dim datasets, never gaining `precedence`/`sql`/ `dataSourceTableId` on the way out. - Pre-aggregation dataset keeps `precedence`, `dataSourceTableId`, and both flavors of `aggregatedFacts.sourceFactReference` — SUM-of-fact AND APPROXIMATE_COUNT-of-attribute (the HLL load-bearing shape). - Synthesized dim keeps its `sql.statement` and the attribute's `sourceColumn` mapping. The fixture is shared with the prior round-trip test, so the two stay aligned on what an agg-aware shape looks like. No SDK code change needed — `put_declarative_ldm` already handles these shapes correctly thanks to the recent dataset-class additions; this test ensures it stays that way. JIRA: CQ-2320 risk: low
YAML ↔ declarative round-trip tests for the three new aggregate-aware shapes the WASM convertor handles end-to-end on 11.35.0a2: - AUXILIARY datasets (no physical mapping; synthetic identity attrs). - NORMAL pre-aggregation datasets with `aggregated_facts` — both the vanilla SUM-of-fact path and APPROXIMATE_COUNT-of-attribute (HLL synopses targeting the AUX identity attribute, requires `reference.type == "attribute"` per gdc-nas CQ-2147). - NORMAL synthesized dim datasets backed by a `sql:` block. These guard the SDK side of the AAC convertor pipeline; the heavy lifting lives in `gooddata-code-convertors`. risk: low
Add typed `set_hll_type` / `get_hll_type` methods on
`CatalogOrganizationService` for the new `hyperLogLogType` org setting
introduced by gdc-nas (`HLL_TYPE("hyperLogLogType",
SettingConfiguration.HyperLogLogType)`).
The setting controls which HLL function family calcique uses when
generating SQL over HLL synopses:
- `"Native"` (default) emits StarRocks-native `HLL_*` functions —
works when the platform (or PIPE pipeline) builds synopses
StarRocks-side.
- `"Presto"` emits Presto-compatible HLL functions, required when
synopses arrive from an upstream Presto pipeline (the binary layout
and hash family differ between the two). Requires the StarRocks
deployment to carry the Presto HLL UDFs.
Surface:
- `HLLType` Literal alias (`"Native" | "Presto"`) re-exported from
`gooddata_sdk` for consumer annotations.
- `HLL_TYPE_SETTING_ID` (`"hyperLogLogType"`) and
`HLL_TYPE_SETTING_TYPE` (`"HLL_TYPE"`) constants for callers that
prefer to drive the generic `CatalogOrganizationSetting.init(...)`
path directly.
- `service.set_hll_type(value)` is idempotent — tries update first,
falls back to create when the setting doesn't exist yet.
- `service.get_hll_type()` returns `HLLType | None`; defensively
returns `None` for unrecognized stored values.
Tests: 7 unit tests with mocked `entities_api` — no live stack needed.
Cover create-on-missing, update-on-existing, both `"Native"` and
`"Presto"` reads, absent-setting case, and the public `HLLType` alias.
Verified: full SDK unit suite — 438 passed, 2 skipped, 1 xfailed.
JIRA: CQ-2320
risk: low
Public docs for the new surfaces this branch introduces: - `data/ai-lake/`: new section with `_index.md` and method pages for `analyze_statistics`, `get_operation`, and `wait_for_operation`. Documents the long-running-operation contract callers see when they drive AI Lake actions from the SDK (today: ANALYZE TABLE for pre-aggregation tables; the surface will grow as more actions are wrapped in typed helpers). - `administration/organization/set_hll_type.md` and `get_hll_type.md`, with the organization `_index.md` sidebar updated to link them. - AFM compute-model smoke tests for `APPROXIMATE_COUNT` aggregation in `SimpleMetric`. risk: low
Contributor
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
hkad98
approved these changes
May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings the Python SDK in line with the gdc-nas + gdc-ui aggregate-aware LDM
work so HyperLogLog (HLL) workloads round-trip end-to-end.
Highlights:
SourceReferenceIdentifier(polymorphicfact|attributereference) and thenew AUXILIARY / pre-aggregation / AI-Lake surfaces. Makefile hardened with
NUL-byte stripping, jq cycle-break for the openapi-generator allOf cycles,
and a Python post-processor under
scripts/postprocess_api_client.pysofuture regenerations "just work".
CatalogFactIdentifiernow writesSourceReferenceIdentifier, acceptingFACTorATTRIBUTEtargets — theAFM-side counterpart of HLL
APPROXIMATE_COUNTaggregated_facts.Literal["NORMAL", "AUXILIARY"] | Nonetypefield on datasets and relaxedsource_columnoptionality onAttribute/Fact/Label so AUXILIARY datasets (which carry no physical mapping)
round-trip cleanly. Documented Option 3 (factory constructors) as a future
refactoring TODO.
catalog_ai_lake.analyze_statistics/get_operation/wait_for_operationwith a typedCatalogAILakeOperationhandle (OperationStatus = Literal["pending", "succeeded", "failed"]).set_hll_type/get_hll_typetypedhelpers (
HLLType = Literal["Native", "Presto"]) over the genericorganization-setting machinery, with a create-or-update fallback.
gooddata-code-convertors>=11.35.0a2— that prereleasecarries the fix for APPROXIMATE_COUNT-of-attribute targets emitting
reference.type="attribute"(was"fact"in 11.33.x/11.34.x).put-declarative-LDM, AI-Lake service, HLL_TYPE setting, and AFM
APPROXIMATE_COUNT smoke. Suite goes from baseline → 443 passed, 2
skipped, 0 xfailed.
set_hll_type/get_hll_typeunderdocs/content/en/latest/administration/organization/.<test_plan>-
cd packages/gooddata-sdk && uv run --no-sync pytest tests/— 443 passed,2 skipped (1m suite).
uv run --no-sync pytest tests/catalog/unit_tests/test_aac_agg_aware.py -v— 4 passed including the previously-xfailed APPROXIMATE_COUNT-of-attribute
round-trip, now passing naturally on 11.35.0a2.
endpoint;
make _api-client-generategreen on a clean tree.the stack.</test_plan>
Test plan
JIRA: CQ-2320
risk: low