[feature](cloud) Add table-level event-driven warm up#63832
Open
bobhan1 wants to merge 5 commits into
Open
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
65920e0 to
b67c9f7
Compare
Contributor
Author
|
run buildall |
Contributor
TPC-H: Total hot run time: 31875 ms |
Contributor
TPC-DS: Total hot run time: 172324 ms |
Contributor
FE Regression Coverage ReportIncrement line coverage |
bobhan1
added a commit
to bobhan1/doris
that referenced
this pull request
May 29, 2026
### What problem does this PR solve? Issue Number: None Related PR: apache#63832 Problem Summary: The table-level warm-up change adds a table_id argument before sync_wait_timeout_ms in CloudWarmUpManager::warm_up_rowset. After rebasing onto the latest master, the existing CloudWarmUpManagerTest calls still used the old two-argument form, so the positive-timeout test passed 1000 as table_id and left sync_wait_timeout_ms at its default -1. That made the test take the async non-positive-timeout branch, so the before-wait sync point was never reached and the spurious notify assertion failed. Update the test calls to pass table_id and sync_wait_timeout_ms explicitly. ### Release note None ### Check List (For Author) - Test: - Unit Test: ./run-be-ut.sh --run --filter=CloudWarmUpManagerTest.* -j100 - Behavior changed: No. - Does this need documentation: No.
Contributor
Author
|
run buildall |
Contributor
TPC-H: Total hot run time: 31958 ms |
Contributor
TPC-DS: Total hot run time: 172417 ms |
bobhan1
added a commit
to bobhan1/doris
that referenced
this pull request
May 29, 2026
### What problem does this PR solve? Issue Number: None Related PR: apache#63832 Problem Summary: The table-level warm-up table filter performance tests used tight wall-clock thresholds for the 200K and 500K wildcard match-all cases. CI machines can run these scale tests slightly slower than local runs even though the matching implementation remains efficient. Relax the 200K threshold from 1s to 1.5s and the 500K threshold from 2s to 3s while keeping the existing functional assertions and smaller or more selective performance checks. ### Release note None ### Check List (For Author) - Test: - Unit Test: ./run-fe-ut.sh --run org.apache.doris.cloud.CacheHotspotManagerTableFilterTest - Behavior changed: No. - Does this need documentation: No.
Contributor
Author
|
run buildall |
Contributor
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
Contributor
FE UT Coverage ReportIncrement line coverage |
bobhan1
added a commit
to bobhan1/doris
that referenced
this pull request
May 29, 2026
### What problem does this PR solve? Issue Number: None Related PR: apache#63832 Problem Summary: The table-level warm-up table filter performance test for 200K tables with 15 include/exclude rules still used a tight 2s wall-clock threshold. CI can exceed that threshold under load while the matcher remains functionally correct. Relax the threshold to 3s and keep the matched-table assertion unchanged. ### Release note None ### Check List (For Author) - Test: - Unit Test: ./run-fe-ut.sh --run org.apache.doris.cloud.CacheHotspotManagerTableFilterTest - Behavior changed: No. - Does this need documentation: No.
Contributor
Author
|
run buildall |
gavinchou
previously approved these changes
May 29, 2026
Issue Number: None
Related PR: None
Problem Summary: Add table-level event-driven warm-up support for cloud warm-up jobs. The change extends WARM UP ... ON TABLES parsing and validation, persists normalized include and exclude table filters, resolves matching table ids dynamically, prevents conflicting cluster-level and table-level load-event jobs, propagates table ids through BE warm-up requests, records per-job source and target warm-up progress metrics, and exposes compact and detailed SyncStats through SHOW WARM UP JOB and FE metrics. Virtual compute group rebuilds cancel existing table-level load-event jobs before recreating managed cluster-level jobs.
Support table-level event-driven cloud warm-up with ON TABLES filters and warm-up sync statistics.
- Test:
- Unit Test: ./run-fe-ut.sh --run org.apache.doris.cloud.OnTablesFilterTest,org.apache.doris.cloud.CloudWarmUpJobTableFilterTest,org.apache.doris.cloud.CacheHotspotManagerTableFilterTest,org.apache.doris.cloud.WarmUpStatsTest,org.apache.doris.cloud.WarmUpClusterOnTablesParseTest,org.apache.doris.cloud.catalog.CloudInstanceStatusCheckerTest,org.apache.doris.metric.MetricsTest#testCloudWarmUpSyncJobMetricsReadStatsDirectlyFromJob+testEventDrivenCloudWarmUpSyncJobTriggerGapMetric
- Unit Test: ./run-be-ut.sh --run --filter=CloudWarmUpManagerFilterTest.*:MBvarWindowedAdderTest.* -j100
- Manual test: build-support/check-format.sh
- Manual test: ./build.sh --be --fe --cloud -j100
- Manual test: docker build -f docker/runtime/doris-compose/Dockerfile -t bh-cluster-2 .
- Manual test: ./run-regression-test.sh --clean --compile
- Regression test: env -u HTTP_PROXY -u HTTPS_PROXY -u http_proxy -u https_proxy -u ALL_PROXY -u all_proxy ./run-regression-test.sh --run -d regression-test/suites/cloud_p0/cache/multi_cluster/warm_up/on_tables -runMode=cloud -image bh-cluster-2 -dockerSuiteParallel 1 (18/19 passed; test_warm_up_event_on_tables_overlap_and_mv failed due test SQL duplicate MV column name before the test was fixed)
- Regression test: env -u HTTP_PROXY -u HTTPS_PROXY -u http_proxy -u https_proxy -u ALL_PROXY -u all_proxy ./run-regression-test.sh --run -d regression-test/suites/cloud_p0/cache/multi_cluster/warm_up/on_tables -s test_warm_up_event_on_tables_overlap_and_mv -runMode=cloud -image bh-cluster-2 -dockerSuiteParallel 1
- Behavior changed: Yes. WARM UP supports ON TABLES filters for event-driven load warm-up and SHOW WARM UP JOB exposes table filter, matched tables, and sync stats.
- Does this need documentation: Yes. Documentation for the new ON TABLES syntax and metrics should be added separately.
### What problem does this PR solve? Issue Number: None Related PR: apache#63832 Problem Summary: The table-level warm-up change adds a table_id argument before sync_wait_timeout_ms in CloudWarmUpManager::warm_up_rowset. After rebasing onto the latest master, the existing CloudWarmUpManagerTest calls still used the old two-argument form, so the positive-timeout test passed 1000 as table_id and left sync_wait_timeout_ms at its default -1. That made the test take the async non-positive-timeout branch, so the before-wait sync point was never reached and the spurious notify assertion failed. Update the test calls to pass table_id and sync_wait_timeout_ms explicitly. ### Release note None ### Check List (For Author) - Test: - Unit Test: ./run-be-ut.sh --run --filter=CloudWarmUpManagerTest.* -j100 - Behavior changed: No. - Does this need documentation: No.
### What problem does this PR solve? Issue Number: None Related PR: apache#63832 Problem Summary: The table-level warm-up table filter performance tests used tight wall-clock thresholds for the 200K and 500K wildcard match-all cases. CI machines can run these scale tests slightly slower than local runs even though the matching implementation remains efficient. Relax the 200K threshold from 1s to 1.5s and the 500K threshold from 2s to 3s while keeping the existing functional assertions and smaller or more selective performance checks. ### Release note None ### Check List (For Author) - Test: - Unit Test: ./run-fe-ut.sh --run org.apache.doris.cloud.CacheHotspotManagerTableFilterTest - Behavior changed: No. - Does this need documentation: No.
### What problem does this PR solve? Issue Number: None Related PR: apache#63832 Problem Summary: The table-level warm-up table filter performance test for 200K tables with 15 include/exclude rules still used a tight 2s wall-clock threshold. CI can exceed that threshold under load while the matcher remains functionally correct. Relax the threshold to 3s and keep the matched-table assertion unchanged. ### Release note None ### Check List (For Author) - Test: - Unit Test: ./run-fe-ut.sh --run org.apache.doris.cloud.CacheHotspotManagerTableFilterTest - Behavior changed: No. - Does this need documentation: No.
a67fe97 to
44f6b85
Compare
gavinchou
reviewed
May 29, 2026
| static constexpr int WINDOW_30M = 1800; | ||
| static constexpr int WINDOW_1H = 3600; | ||
|
|
||
| MBvarWindowedAdder g_warmup_ed_finish_segment_num("warmup_ed_finish_segment_num", {"job_id"}, |
Contributor
There was a problem hiding this comment.
is there any memory issues if there are many jobs.
how does bvar implement "windows", does it recored every smaples of the adder every second?
gavinchou
reviewed
May 29, 2026
| failure_msg.append(failures[i].reason); | ||
| } | ||
|
|
||
| return Status::Error(code, |
Contributor
TPC-H: Total hot run time: 31398 ms |
Contributor
TPC-H: Total hot run time: 31974 ms |
Contributor
TPC-DS: Total hot run time: 172895 ms |
Contributor
TPC-DS: Total hot run time: 171939 ms |
Contributor
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue Number: None
Problem Summary:
This PR adds table-level event-driven cloud warm-up support and improves active incremental warm-up progress observability.
Before this change, event-driven warm-up was only controlled at compute-group granularity. Once a load-event warm-up job was enabled for a source and target compute group pair, all source-side table writes could trigger warm-up to the target compute group. That is inefficient for workloads where only selected core tables, high-frequency query tables, or selected async materialized views need to stay warm.
This PR lets users define the warm-up scope with
ON TABLESwhen creating an event-driven load warm-up job. FE persists the normalized table filter in the warm-up job, resolves matched table ids dynamically, sends the table ids to BE, and lets BE filter warm-up rowsets by table id.User-visible behavior:
WARM UP ... ON TABLESsupports table-level event-driven warm-up.INCLUDEandEXCLUDErules.*and?wildcards, for exampledb.table,db.*,*.orders_*, andlog_db.log_?.INCLUDEdefines the candidate warm-up scope, andEXCLUDEremoves tables from that included scope.SHOW WARM UP JOBexposes the table-level job type, table filter, matched tables, and SyncStats.SHOW WARM UP JOBlist output keeps compact SyncStats, while single-job lookup keeps detailed windowed SyncStats.Example:
Conflict and virtual compute group behavior:
Warm-up progress observation:
/api/warmup_event_driven_stats./metricsexposes per-job active warm-up metadata, synchronized size, and trigger gap metrics for cloud event-driven warm-up jobs.Release note
Support table-level event-driven cloud warm-up with
ON TABLESfilters and per-job warm-up sync statistics.Check List (For Author)
Test
Behavior changed:
WARM UPsupports table-levelON TABLESfilters for event-driven load warm-up, and warm-up job output/metrics expose table filter, matched tables, SyncStats, and trigger-gap information.Does this need documentation?
Check List (For Reviewer who merge this PR)