feat(dao): add operator_port_cache table by Xiao-zhen-Liu · Pull Request #5967 · apache/texera

Xiao-zhen-Liu · 2026-06-28T17:29:59Z

What changes were proposed in this PR?

Adds the operator_port_cache table that records a materialized output port
result so it can be reused across executions. It is keyed by
(workflow_id, global_port_id, cache_key) and stores the JSON the cache key was
computed from, the result location, an optional tuple count and source execution
id, and a database-managed updated_at. The foreign key to workflow(wid) is
ON DELETE CASCADE. The stored JSON (cache_key_json) lets a lookup confirm a
hash match by comparing the full JSON, so a hash collision never reuses the wrong
result.

The change is additive: a new table in sql/texera_ddl.sql (fresh installs) plus
a Liquibase migration sql/updates/26.sql registered in sql/changelog.xml
(existing deployments). No code reads or writes the table yet; the cache read/write
logic and its tests land with the cache service that uses it, following the
convention of testing a table through its consumer (as feedback is tested via
FeedbackResourceSpec).

Any related issues, documentation, discussions?

Closes #5969. Part of the storage foundation #5882 (umbrella #5881). Design discussion: #5880.

How was this PR tested?

Verified the schema directly against Postgres: the migration applies cleanly, the
columns and primary key (workflow_id, global_port_id, cache_key) are correct,
the foreign key's delete rule is CASCADE, the schema file and the migration
define identical columns/keys, and changelog.xml is well-formed and registers
26.sql. The generated jOOQ classes build from the table. The table's runtime
behavior is exercised by the cache service tests in the follow-up PR.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8 (Claude Code)

Adds the operator_port_cache table (texera_ddl.sql + Liquibase migration sql/updates/26.sql), keyed by (workflow_id, global_port_id, cache_key) with ON DELETE CASCADE to workflow. The cache read/write logic and its tests land with the cache service that uses it. Part of apache#5882.

github-actions · 2026-06-28T17:30:16Z

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

Contributors with relevant context: @aicam
You can notify them by mentioning @aicam in a comment.

codecov-commenter · 2026-06-28T17:32:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.63%. Comparing base (a24d1d1) to head (8a6554e).
⚠️ Report is 27 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #5967      +/-   ##
============================================
+ Coverage     56.28%   56.63%   +0.35%     
- Complexity     2992     3023      +31     
============================================
  Files          1120     1121       +1     
  Lines         43217    43288      +71     
  Branches       4662     4667       +5     
============================================
+ Hits          24326    24518     +192     
+ Misses        17472    17330     -142     
- Partials       1419     1440      +21

Flag	Coverage Δ		*Carryforward flag
access-control-service	`70.00% <ø> (ø)`
agent-service	`44.95% <ø> (ø)`		Carriedforward from f1d85fb
amber	`58.59% <ø> (+0.78%)`	⬆️
computing-unit-managing-service	`0.00% <ø> (ø)`
config-service	`52.30% <ø> (+0.74%)`	⬆️
file-service	`62.81% <ø> (+3.79%)`	⬆️
frontend	`49.33% <ø> (ø)`		Carriedforward from f1d85fb
notebook-migration-service	`78.57% <ø> (ø)`
pyamber	`90.20% <ø> (ø)`		Carriedforward from f1d85fb
python	`90.76% <ø> (ø)`		Carriedforward from f1d85fb
workflow-compiling-service	`55.14% <ø> (ø)`

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Xiao-zhen-Liu · 2026-06-28T17:35:10Z

@carloea2 could you help review this one when you get a chance? Thanks!

github-actions · 2026-06-28T17:35:43Z

⚠️ Benchmark changes need a look

🟢 7 better · 🔴 3 worse · ⚪ 5 noise (<±5%) · 0 without baseline

Compared against main a24d1d1 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

	config	throughput	MB/s	latency	max Δ latest / 7d
🔴	bs=10 sw=10 sl=64	426	0.26	23,160/30,286/30,286 us	🔴 +11.6% / 🔴 +101.0%
🟢	bs=100 sw=10 sl=64	950	0.58	105,644/119,331/119,331 us	🟢 -10.0% / 🔴 +9.6%
🔴	bs=1000 sw=10 sl=64	1,085	0.662	918,400/1,018,610/1,018,610 us	🔴 +6.7% / 🟢 -8.3%

Baseline details

Latest main a24d1d1 from same runner

config	metric	PR	latest main	7d avg	Δ latest	Δ 7d
bs=10 sw=10 sl=64	throughput	426 tuples/sec	447 tuples/sec	770.82 tuples/sec	-4.7%	-44.7%
bs=10 sw=10 sl=64	MB/s	0.26 MB/s	0.273 MB/s	0.47 MB/s	-4.8%	-44.7%
bs=10 sw=10 sl=64	p50	23,160 us	20,758 us	12,723 us	+11.6%	+82.0%
bs=10 sw=10 sl=64	p95	30,286 us	32,226 us	15,070 us	-6.0%	+101.0%
bs=10 sw=10 sl=64	p99	30,286 us	32,226 us	18,429 us	-6.0%	+64.3%
bs=100 sw=10 sl=64	throughput	950 tuples/sec	885 tuples/sec	973.75 tuples/sec	+7.3%	-2.4%
bs=100 sw=10 sl=64	MB/s	0.58 MB/s	0.54 MB/s	0.594 MB/s	+7.4%	-2.4%
bs=100 sw=10 sl=64	p50	105,644 us	111,819 us	102,519 us	-5.5%	+3.0%
bs=100 sw=10 sl=64	p95	119,331 us	132,558 us	108,855 us	-10.0%	+9.6%
bs=100 sw=10 sl=64	p99	119,331 us	132,558 us	117,788 us	-10.0%	+1.3%
bs=1000 sw=10 sl=64	throughput	1,085 tuples/sec	1,115 tuples/sec	1,004 tuples/sec	-2.7%	+8.1%
bs=1000 sw=10 sl=64	MB/s	0.662 MB/s	0.681 MB/s	0.613 MB/s	-2.8%	+8.0%
bs=1000 sw=10 sl=64	p50	918,400 us	890,152 us	1,001,930 us	+3.2%	-8.3%
bs=1000 sw=10 sl=64	p95	1,018,610 us	954,686 us	1,042,923 us	+6.7%	-2.3%
bs=1000 sw=10 sl=64	p99	1,018,610 us	954,686 us	1,074,893 us	+6.7%	-5.2%

Raw CSV

config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,469.09,200,128000,426,0.260,23159.89,30286.23,30286.23
1,100,10,64,20,2105.44,2000,1280000,950,0.580,105644.20,119331.22,119331.22
2,1000,10,64,20,18426.53,20000,12800000,1085,0.662,918399.53,1018610.21,1018610.21

Yicong-Huang · 2026-06-28T18:43:10Z

+    cache_key           CHAR(64) NOT NULL,
+    cache_key_json      TEXT NOT NULL,


why do we have two parts?

Two columns on purpose: cache_key is the SHA-256 hash we look up by (and part of the PK), and cache_key_json is the JSON it was computed from. On a hash hit we compare the stored JSON to confirm the match, so a hash collision can never reuse the wrong result.

Yicong-Huang · 2026-06-28T18:43:40Z

+    global_port_id      VARCHAR(200) NOT NULL,
+    cache_key           CHAR(64) NOT NULL,
+    cache_key_json      TEXT NOT NULL,
+    result_uri          TEXT NOT NULL,


please change to storage_uri. result indicates direction.

Done, renamed to storage_uri.

Yicong-Huang · 2026-06-28T18:45:24Z

+    cache_key           CHAR(64) NOT NULL,
+    cache_key_json      TEXT NOT NULL,
+    result_uri          TEXT NOT NULL,
+    tuple_count         BIGINT,


do we want to store statistics separately? I think iceberg catalog has this info already, is this duplicate?

I'd keep tuple_count. A cache row is immutable (a changed computation is a new row, never an update), so it can't drift from the Iceberg result. It's populated at materialization and read by the coordinator during execution — the coordinator already queries this table for the cache lookup, so a cached region's output stats come from the same row with no extra read. Reading it from getCount() instead is a catalog/storage round-trip per cached port on the coordinator, which is the work the cache is meant to avoid.

Yicong-Huang · 2026-06-28T18:46:40Z

+    tuple_count         BIGINT,
+    source_execution_id BIGINT,
+    updated_at          TIMESTAMPTZ NOT NULL DEFAULT now(),
+    PRIMARY KEY (workflow_id, global_port_id, cache_key),


how come execution id is not part of primary key?

Intentional — the cache is reused across executions, keyed by cache_key (the upstream computation hash). Putting execution_id in the PK would create a row per execution and prevent reuse. source_execution_id is provenance metadata. The per-execution tables key by execution because they store per-execution results — a different concern.

Yicong-Huang · 2026-06-29T00:31:26Z

@Xiao-zhen-Liu please link issue properly

Address review: result implies a direction, storage_uri is clearer. tuple_count is kept (immutable per row, populated at materialization, read by the coordinator alongside the cache lookup so cached-region stats need no extra storage round-trip).

Xiao-zhen-Liu · 2026-06-29T17:05:59Z

Thanks @Yicong-Huang — replies inline. Renamed result_uri -> storage_uri. Two I kept, with reasoning inline: tuple_count (a cache row is immutable so it can't drift, and the coordinator reads it alongside the cache lookup so cached-region stats need no extra storage round-trip) and the PK without execution_id (the cache is reused across executions, keyed by cache_key). Re-requesting your review.

github-actions Bot assigned Xiao-zhen-Liu Jun 28, 2026

github-actions Bot added the ddl-change Changes to the TexeraDB DDL label Jun 28, 2026

Xiao-zhen-Liu mentioned this pull request Jun 28, 2026

Add operator output port cache storage (table and cache key) #5882

Open

6 tasks

Xiao-zhen-Liu requested a review from Yicong-Huang June 28, 2026 17:35

Xiao-zhen-Liu mentioned this pull request Jun 28, 2026

Add the operator_port_cache table #5969

Open

6 tasks

Yicong-Huang requested changes Jun 28, 2026

View reviewed changes

Xiao-zhen-Liu requested a review from Yicong-Huang June 29, 2026 17:06

Uh oh!

Conversation

Xiao-zhen-Liu commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this PR?

Any related issues, documentation, discussions?

How was this PR tested?

Was this PR authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Jun 28, 2026

Automated Reviewer Suggestions

Uh oh!

codecov-commenter commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Xiao-zhen-Liu commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Benchmark changes need a look

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang commented Jun 29, 2026

Uh oh!

Xiao-zhen-Liu commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Xiao-zhen-Liu commented Jun 28, 2026 •

edited

Loading

codecov-commenter commented Jun 28, 2026 •

edited

Loading

github-actions Bot commented Jun 28, 2026 •

edited

Loading