Skip to content

feat(dao): add operator_port_cache table#5967

Open
Xiao-zhen-Liu wants to merge 2 commits into
apache:mainfrom
Xiao-zhen-Liu:cache-table
Open

feat(dao): add operator_port_cache table#5967
Xiao-zhen-Liu wants to merge 2 commits into
apache:mainfrom
Xiao-zhen-Liu:cache-table

Conversation

@Xiao-zhen-Liu

@Xiao-zhen-Liu Xiao-zhen-Liu commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

Adds the operator_port_cache table that records a materialized output port
result so it can be reused across executions. It is keyed by
(workflow_id, global_port_id, cache_key) and stores the JSON the cache key was
computed from, the result location, an optional tuple count and source execution
id, and a database-managed updated_at. The foreign key to workflow(wid) is
ON DELETE CASCADE. The stored JSON (cache_key_json) lets a lookup confirm a
hash match by comparing the full JSON, so a hash collision never reuses the wrong
result.

The change is additive: a new table in sql/texera_ddl.sql (fresh installs) plus
a Liquibase migration sql/updates/26.sql registered in sql/changelog.xml
(existing deployments). No code reads or writes the table yet; the cache read/write
logic and its tests land with the cache service that uses it, following the
convention of testing a table through its consumer (as feedback is tested via
FeedbackResourceSpec).

Any related issues, documentation, discussions?

Closes #5969. Part of the storage foundation #5882 (umbrella #5881). Design discussion: #5880.

How was this PR tested?

Verified the schema directly against Postgres: the migration applies cleanly, the
columns and primary key (workflow_id, global_port_id, cache_key) are correct,
the foreign key's delete rule is CASCADE, the schema file and the migration
define identical columns/keys, and changelog.xml is well-formed and registers
26.sql. The generated jOOQ classes build from the table. The table's runtime
behavior is exercised by the cache service tests in the follow-up PR.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8 (Claude Code)

Adds the operator_port_cache table (texera_ddl.sql + Liquibase migration sql/updates/26.sql), keyed by (workflow_id, global_port_id, cache_key) with ON DELETE CASCADE to workflow. The cache read/write logic and its tests land with the cache service that uses it. Part of apache#5882.
@github-actions github-actions Bot added the ddl-change Changes to the TexeraDB DDL label Jun 28, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @aicam
    You can notify them by mentioning @aicam in a comment.

@codecov-commenter

codecov-commenter commented Jun 28, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.63%. Comparing base (a24d1d1) to head (8a6554e).
⚠️ Report is 27 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5967      +/-   ##
============================================
+ Coverage     56.28%   56.63%   +0.35%     
- Complexity     2992     3023      +31     
============================================
  Files          1120     1121       +1     
  Lines         43217    43288      +71     
  Branches       4662     4667       +5     
============================================
+ Hits          24326    24518     +192     
+ Misses        17472    17330     -142     
- Partials       1419     1440      +21     
Flag Coverage Δ *Carryforward flag
access-control-service 70.00% <ø> (ø)
agent-service 44.95% <ø> (ø) Carriedforward from f1d85fb
amber 58.59% <ø> (+0.78%) ⬆️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 52.30% <ø> (+0.74%) ⬆️
file-service 62.81% <ø> (+3.79%) ⬆️
frontend 49.33% <ø> (ø) Carriedforward from f1d85fb
notebook-migration-service 78.57% <ø> (ø)
pyamber 90.20% <ø> (ø) Carriedforward from f1d85fb
python 90.76% <ø> (ø) Carriedforward from f1d85fb
workflow-compiling-service 55.14% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Xiao-zhen-Liu

Copy link
Copy Markdown
Contributor Author

@carloea2 could you help review this one when you get a chance? Thanks!

@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 7 better · 🔴 3 worse · ⚪ 5 noise (<±5%) · 0 without baseline

Compared against main a24d1d1 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 426 0.26 23,160/30,286/30,286 us 🔴 +11.6% / 🔴 +101.0%
🟢 bs=100 sw=10 sl=64 950 0.58 105,644/119,331/119,331 us 🟢 -10.0% / 🔴 +9.6%
🔴 bs=1000 sw=10 sl=64 1,085 0.662 918,400/1,018,610/1,018,610 us 🔴 +6.7% / 🟢 -8.3%
Baseline details

Latest main a24d1d1 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 426 tuples/sec 447 tuples/sec 770.82 tuples/sec -4.7% -44.7%
bs=10 sw=10 sl=64 MB/s 0.26 MB/s 0.273 MB/s 0.47 MB/s -4.8% -44.7%
bs=10 sw=10 sl=64 p50 23,160 us 20,758 us 12,723 us +11.6% +82.0%
bs=10 sw=10 sl=64 p95 30,286 us 32,226 us 15,070 us -6.0% +101.0%
bs=10 sw=10 sl=64 p99 30,286 us 32,226 us 18,429 us -6.0% +64.3%
bs=100 sw=10 sl=64 throughput 950 tuples/sec 885 tuples/sec 973.75 tuples/sec +7.3% -2.4%
bs=100 sw=10 sl=64 MB/s 0.58 MB/s 0.54 MB/s 0.594 MB/s +7.4% -2.4%
bs=100 sw=10 sl=64 p50 105,644 us 111,819 us 102,519 us -5.5% +3.0%
bs=100 sw=10 sl=64 p95 119,331 us 132,558 us 108,855 us -10.0% +9.6%
bs=100 sw=10 sl=64 p99 119,331 us 132,558 us 117,788 us -10.0% +1.3%
bs=1000 sw=10 sl=64 throughput 1,085 tuples/sec 1,115 tuples/sec 1,004 tuples/sec -2.7% +8.1%
bs=1000 sw=10 sl=64 MB/s 0.662 MB/s 0.681 MB/s 0.613 MB/s -2.8% +8.0%
bs=1000 sw=10 sl=64 p50 918,400 us 890,152 us 1,001,930 us +3.2% -8.3%
bs=1000 sw=10 sl=64 p95 1,018,610 us 954,686 us 1,042,923 us +6.7% -2.3%
bs=1000 sw=10 sl=64 p99 1,018,610 us 954,686 us 1,074,893 us +6.7% -5.2%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,469.09,200,128000,426,0.260,23159.89,30286.23,30286.23
1,100,10,64,20,2105.44,2000,1280000,950,0.580,105644.20,119331.22,119331.22
2,1000,10,64,20,18426.53,20000,12800000,1085,0.662,918399.53,1018610.21,1018610.21

Comment thread sql/updates/26.sql
Comment on lines +37 to +38
cache_key CHAR(64) NOT NULL,
cache_key_json TEXT NOT NULL,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have two parts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two columns on purpose: cache_key is the SHA-256 hash we look up by (and part of the PK), and cache_key_json is the JSON it was computed from. On a hash hit we compare the stored JSON to confirm the match, so a hash collision can never reuse the wrong result.

Comment thread sql/updates/26.sql Outdated
global_port_id VARCHAR(200) NOT NULL,
cache_key CHAR(64) NOT NULL,
cache_key_json TEXT NOT NULL,
result_uri TEXT NOT NULL,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please change to storage_uri. result indicates direction.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, renamed to storage_uri.

Comment thread sql/updates/26.sql
cache_key CHAR(64) NOT NULL,
cache_key_json TEXT NOT NULL,
result_uri TEXT NOT NULL,
tuple_count BIGINT,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to store statistics separately? I think iceberg catalog has this info already, is this duplicate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep tuple_count. A cache row is immutable (a changed computation is a new row, never an update), so it can't drift from the Iceberg result. It's populated at materialization and read by the coordinator during execution — the coordinator already queries this table for the cache lookup, so a cached region's output stats come from the same row with no extra read. Reading it from getCount() instead is a catalog/storage round-trip per cached port on the coordinator, which is the work the cache is meant to avoid.

Comment thread sql/texera_ddl.sql
tuple_count BIGINT,
source_execution_id BIGINT,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (workflow_id, global_port_id, cache_key),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come execution id is not part of primary key?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional — the cache is reused across executions, keyed by cache_key (the upstream computation hash). Putting execution_id in the PK would create a row per execution and prevent reuse. source_execution_id is provenance metadata. The per-execution tables key by execution because they store per-execution results — a different concern.

@Yicong-Huang

Copy link
Copy Markdown
Contributor

@Xiao-zhen-Liu please link issue properly

Address review: result implies a direction, storage_uri is clearer. tuple_count is kept (immutable per row, populated at materialization, read by the coordinator alongside the cache lookup so cached-region stats need no extra storage round-trip).
@Xiao-zhen-Liu

Copy link
Copy Markdown
Contributor Author

Thanks @Yicong-Huang — replies inline. Renamed result_uri -> storage_uri. Two I kept, with reasoning inline: tuple_count (a cache row is immutable so it can't drift, and the coordinator reads it alongside the cache lookup so cached-region stats need no extra storage round-trip) and the PK without execution_id (the cache is reused across executions, keyed by cache_key). Re-requesting your review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ddl-change Changes to the TexeraDB DDL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add the operator_port_cache table

3 participants