Skip to content

[core] Introduce vector-store for data-evolution table#7240

Open
ColdL wants to merge 2 commits intoapache:masterfrom
ColdL:add-vectortype-dataevolution
Open

[core] Introduce vector-store for data-evolution table#7240
ColdL wants to merge 2 commits intoapache:masterfrom
ColdL:add-vectortype-dataevolution

Conversation

@ColdL
Copy link
Contributor

@ColdL ColdL commented Feb 9, 2026

Purpose

Linked issue: update #7011

The goal of this PR is to optimize storage layout for vector scenarios in the Data Evolution table, specifically by storing vector columns and potentially associated columns in specified file formats.

For example, scalar columns can be stored using Parquet format, while vector columns and columns that may require point lookups during vector search can be stored using file formats like Lance.

1. Configuration

This PR introduces three new configuration options:

  • vector-field: defines the column names for separate storage
  • vector.file.format: defines the file format
  • vector.target-file-size: specifies the file size threshold for rolling

2. Storage Layout

When this feature is enabled, a set of columns specified by vector-field will be stored separately in the file format specified by vector.file.format, marked by .vector-store. in the data file path.

File Path Pattern: data-xxx-{count}.vector-store.{file-format}

This design serves two purposes:

  • (1) The .vector-store. segment identifies these as separately stored column groups
  • (2) The trailing .{file-format} follows current conventions, using the file format as the suffix

Note: Perhaps .vector. is better than .vector-store., if confirmed, I will update this accordingly. Please see the discussion below for details.

The final storage layout might be:

  • data-xxx-0.parquet
  • data-xxx-1.blob
  • data-xxx-2.vector-store.lance
  • data-xxx-3.vector-store.lance
  • data-xxx-4.vector-store.lance

These vector-store files are associated with regular columns through Row-tracking / Data Evolution.

Tests

API and Format

Documentation

@ColdL ColdL marked this pull request as draft February 9, 2026 08:13
@ColdL ColdL marked this pull request as ready for review February 9, 2026 08:13
@ColdL ColdL marked this pull request as draft February 9, 2026 08:13
@ColdL ColdL force-pushed the add-vectortype-dataevolution branch 3 times, most recently from 0ef7695 to 5e012b4 Compare February 12, 2026 09:17
@ColdL ColdL marked this pull request as ready for review February 12, 2026 10:03
@ColdL ColdL force-pushed the add-vectortype-dataevolution branch from 5e012b4 to 4976c76 Compare February 25, 2026 02:37
.noDefaultValue()
.withDescription("Specify the vector store fields.");

public static final ConfigOption<MemorySize> VECTOR_STORE_TARGET_FILE_SIZE =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about these names? @JingsongLi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Now the config names in code are consistent with the public configuration keys.

@leaves12138 leaves12138 changed the title add vector-store with data evolution [core] Introduce vector-store for data-evolution table Feb 26, 2026
@ColdL ColdL force-pushed the add-vectortype-dataevolution branch 2 times, most recently from 637e7d6 to 549c2b3 Compare February 26, 2026 10:08
Copy link
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks for @ColdL , can you rebase the latest master to resolve the conflict

@ColdL ColdL force-pushed the add-vectortype-dataevolution branch 2 times, most recently from fe6d1c1 to 900cb36 Compare February 27, 2026 03:43
@ColdL ColdL force-pushed the add-vectortype-dataevolution branch from 900cb36 to 1bbc24b Compare February 27, 2026 03:48
@JingsongLi
Copy link
Contributor

I think the PR statement should clearly state a few things:

  1. What configuration determines the separate storage of Vector? We need to design this configuration.
  2. How to name the Vector file, perhaps it would be better to end with .vector.

@JingsongLi
Copy link
Contributor

You can also create a separate doc in paimon/docs.

@ColdL
Copy link
Contributor Author

ColdL commented Feb 27, 2026

I think the PR statement should clearly state a few things:

  1. What configuration determines the separate storage of Vector? We need to design this configuration.
  2. How to name the Vector file, perhaps it would be better to end with .vector.

@JingsongLi Thanks for the review! I've updated the PR description. After confirmation, I will continue to update and add the corresponding docs.

@JingsongLi
Copy link
Contributor

JingsongLi commented Feb 27, 2026

@ColdL How about:

  1. vector.file.format => by default none, if configured, store vector separately in file.
  2. file name just use xxx.vector.lance, consistent with blob files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants