feat: introduce Result Service using Lakekeeper as REST catalog for Iceberg - catalog migration #4272
feat: introduce Result Service using Lakekeeper as REST catalog for Iceberg - catalog migration #4272mengw15 wants to merge 14 commits intoapache:mainfrom
Conversation
Signed-off-by: Meng Wang <125719918+mengw15@users.noreply.github.com>
bobbai00
left a comment
There was a problem hiding this comment.
Left some comments.
I closed your original PR. Please describe the milestones (your PR's plan) in the issue and update the description of the current PR.
| cached_property==1.5.2 | ||
| psutil==5.9.0 | ||
| tzlocal==2.1 | ||
| s3fs==2025.9.0 |
There was a problem hiding this comment.
the latest version is 2026.2.0. Can you try to use the latest version ?
There was a problem hiding this comment.
These three libraries have version compatibility constraints, and they also need to stay compatible with boto3. If we try to update them, some other libraries may also need to be updated accordingly.
| tzlocal==2.1 | ||
| s3fs==2025.9.0 | ||
| aiobotocore==2.25.1 | ||
| botocore==1.40.53 |
There was a problem hiding this comment.
Ditto for these two libraries
There was a problem hiding this comment.
These three libraries have version compatibility constraints, and they also need to stay compatible with boto3. If we try to update them, some other libraries may also need to be updated accordingly.
| if (buffer.nonEmpty) { | ||
| // Create a unique file path using the writer's identifier and the filename index | ||
| val filepath = Paths.get(table.location()).resolve(s"${writerIdentifier}_${filenameIdx}") | ||
| // Handle S3 URIs (s3://) differently from local file paths to preserve URI format |
There was a problem hiding this comment.
This logic is very ad-hoc. Can you avoid the if condition of file path's prefix?
There was a problem hiding this comment.
Try to have a universal logic for the file path
There was a problem hiding this comment.
Simplified to use string concatenation for all URI schemes. Would suggest testing on Windows as well, since we've had path-related issues on Windows before.
common/workflow-core/src/main/scala/org/apache/texera/amber/util/IcebergUtil.scala
Outdated
Show resolved
Hide resolved
common/workflow-core/src/main/scala/org/apache/texera/amber/util/IcebergUtil.scala
Outdated
Show resolved
Hide resolved
| TableProperties.COMMIT_MIN_RETRY_WAIT_MS -> StorageConfig.icebergTableCommitMinRetryWaitMs.toString | ||
| ) | ||
|
|
||
| val namespace = Namespace.of(tableNamespace) |
There was a problem hiding this comment.
The purpose of this check?
There was a problem hiding this comment.
This ensures the namespace exists before creating a table. REST catalogs (like Lakekeeper) require the namespace to be explicitly created first, unlike the Postgres JDBC catalog which auto-creates it.
What changes were proposed in this PR?
This is PR 1 of a decomposed series from #4242, focusing on the core Iceberg catalog migration to support Lakekeeper as a
REST catalog.
Scala changes:
IcebergUtil.scala: addedcreateRestCatalog()for REST catalog connections with S3FileIO (MinIO), and namespace auto-creation for all catalog typesIcebergCatalogInstance.scala: updated singleton to support REST catalog type selectionIcebergTableWriter.scala: updated for REST catalog compatibilityStorageConfig.scala/EnvironmentalVariable.scala: added REST catalog configuration (URI, warehouse name, region, S3bucket) and environment variable support
storage.conf: added REST catalog config section (default remainspostgresfor backward compatibility)build.sbt: addediceberg-aws, AWS SDK dependencies, and Netty version override for Arrow compatibilityPythonWorkflowWorker.scala/ComputingUnitManagingResource.scala: propagate REST catalog config to Python workers andcomputing units
Python changes:
iceberg_catalog_instance.py/iceberg_utils.py: added REST catalog support via PyIcebergstorage_config.py: added REST catalog configuration parsingtexera_run_python_worker.py: accept REST catalog config from Scala siderequirements.txt: upgraded PyIceberg (0.8.1 → 0.9.0), added s3fs/aiobotocore for S3 accessDatabase:
texera_lakekeeper.sql: schema for Lakekeeper's backing databaseNote: This PR keeps
postgresas the default catalog type instorage.conf. Switching to REST catalog will be enabledin subsequent deployment PRs.
Any related issues, documentation, discussions?
Part of #4126. Subsequent PRs will cover:
How was this PR tested?
Manual
Was this PR authored or co-authored using generative AI tooling?
co-authored with Claude