Skip to content

feat: introduce Result Service using Lakekeeper as REST catalog for Iceberg - catalog migration #4272

Open
mengw15 wants to merge 14 commits intoapache:mainfrom
mengw15:Lakekeeper-catalog-migration
Open

feat: introduce Result Service using Lakekeeper as REST catalog for Iceberg - catalog migration #4272
mengw15 wants to merge 14 commits intoapache:mainfrom
mengw15:Lakekeeper-catalog-migration

Conversation

@mengw15
Copy link
Contributor

@mengw15 mengw15 commented Mar 9, 2026

What changes were proposed in this PR?

This is PR 1 of a decomposed series from #4242, focusing on the core Iceberg catalog migration to support Lakekeeper as a
REST catalog.

Scala changes:

  • IcebergUtil.scala: added createRestCatalog() for REST catalog connections with S3FileIO (MinIO), and namespace auto-creation for all catalog types
  • IcebergCatalogInstance.scala: updated singleton to support REST catalog type selection
  • IcebergTableWriter.scala: updated for REST catalog compatibility
  • StorageConfig.scala / EnvironmentalVariable.scala: added REST catalog configuration (URI, warehouse name, region, S3
    bucket) and environment variable support
  • storage.conf: added REST catalog config section (default remains postgres for backward compatibility)
  • build.sbt: added iceberg-aws, AWS SDK dependencies, and Netty version override for Arrow compatibility
  • PythonWorkflowWorker.scala / ComputingUnitManagingResource.scala: propagate REST catalog config to Python workers and
    computing units

Python changes:

  • iceberg_catalog_instance.py / iceberg_utils.py: added REST catalog support via PyIceberg
  • storage_config.py: added REST catalog configuration parsing
  • texera_run_python_worker.py: accept REST catalog config from Scala side
  • requirements.txt: upgraded PyIceberg (0.8.1 → 0.9.0), added s3fs/aiobotocore for S3 access

Database:

  • texera_lakekeeper.sql: schema for Lakekeeper's backing database

Note: This PR keeps postgres as the default catalog type in storage.conf. Switching to REST catalog will be enabled
in subsequent deployment PRs.

Any related issues, documentation, discussions?

Part of #4126. Subsequent PRs will cover:

  • Lakekeeper bootstrap script
  • Single-node deployment
  • Kubernetes deployment
  • CI integration

How was this PR tested?

Manual

Was this PR authored or co-authored using generative AI tooling?

co-authored with Claude

@bobbai00 bobbai00 self-requested a review March 17, 2026 22:06
Copy link
Contributor

@bobbai00 bobbai00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

I closed your original PR. Please describe the milestones (your PR's plan) in the issue and update the description of the current PR.

cached_property==1.5.2
psutil==5.9.0
tzlocal==2.1
s3fs==2025.9.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the latest version is 2026.2.0. Can you try to use the latest version ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three libraries have version compatibility constraints, and they also need to stay compatible with boto3. If we try to update them, some other libraries may also need to be updated accordingly.

tzlocal==2.1
s3fs==2025.9.0
aiobotocore==2.25.1
botocore==1.40.53
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for these two libraries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three libraries have version compatibility constraints, and they also need to stay compatible with boto3. If we try to update them, some other libraries may also need to be updated accordingly.

if (buffer.nonEmpty) {
// Create a unique file path using the writer's identifier and the filename index
val filepath = Paths.get(table.location()).resolve(s"${writerIdentifier}_${filenameIdx}")
// Handle S3 URIs (s3://) differently from local file paths to preserve URI format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is very ad-hoc. Can you avoid the if condition of file path's prefix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to have a universal logic for the file path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified to use string concatenation for all URI schemes. Would suggest testing on Windows as well, since we've had path-related issues on Windows before.

TableProperties.COMMIT_MIN_RETRY_WAIT_MS -> StorageConfig.icebergTableCommitMinRetryWaitMs.toString
)

val namespace = Namespace.of(tableNamespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures the namespace exists before creating a table. REST catalogs (like Lakekeeper) require the namespace to be explicitly created first, unlike the Postgres JDBC catalog which auto-creates it.

@mengw15 mengw15 requested a review from bobbai00 March 18, 2026 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common ddl-change Changes to the TexeraDB DDL dependencies Pull requests that update a dependency file engine python service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants