fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race#5915
fix(test): stabilize flaky ReconfigurationIntegrationSpec pause race#5915Ma77Ball wants to merge 4 commits into
Conversation
… a pause is generated leading to a failed ci
Automated Reviewer SuggestionsBased on the
|
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 367 | 0.224 | 25,388/38,533/38,533 us | 🔴 +14.9% / 🔴 +155.7% |
| 🔴 | bs=100 sw=10 sl=64 | 815 | 0.498 | 120,047/146,191/146,191 us | 🔴 +7.6% / 🔴 +34.3% |
| ⚪ | bs=1000 sw=10 sl=64 | 918 | 0.56 | 1,092,413/1,130,014/1,130,014 us | ⚪ within ±5% / 🔴 +9.0% |
Baseline details
Latest main 7a9730b from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 367 tuples/sec | 425 tuples/sec | 770.82 tuples/sec | -13.6% | -52.4% |
| bs=10 sw=10 sl=64 | MB/s | 0.224 MB/s | 0.259 MB/s | 0.47 MB/s | -13.5% | -52.4% |
| bs=10 sw=10 sl=64 | p50 | 25,388 us | 22,369 us | 12,723 us | +13.5% | +99.5% |
| bs=10 sw=10 sl=64 | p95 | 38,533 us | 33,531 us | 15,070 us | +14.9% | +155.7% |
| bs=10 sw=10 sl=64 | p99 | 38,533 us | 33,531 us | 18,429 us | +14.9% | +109.1% |
| bs=100 sw=10 sl=64 | throughput | 815 tuples/sec | 836 tuples/sec | 973.75 tuples/sec | -2.5% | -16.3% |
| bs=100 sw=10 sl=64 | MB/s | 0.498 MB/s | 0.51 MB/s | 0.594 MB/s | -2.4% | -16.2% |
| bs=100 sw=10 sl=64 | p50 | 120,047 us | 117,881 us | 102,519 us | +1.8% | +17.1% |
| bs=100 sw=10 sl=64 | p95 | 146,191 us | 135,825 us | 108,855 us | +7.6% | +34.3% |
| bs=100 sw=10 sl=64 | p99 | 146,191 us | 135,825 us | 117,788 us | +7.6% | +24.1% |
| bs=1000 sw=10 sl=64 | throughput | 918 tuples/sec | 931 tuples/sec | 1,004 tuples/sec | -1.4% | -8.6% |
| bs=1000 sw=10 sl=64 | MB/s | 0.56 MB/s | 0.568 MB/s | 0.613 MB/s | -1.4% | -8.6% |
| bs=1000 sw=10 sl=64 | p50 | 1,092,413 us | 1,066,102 us | 1,001,930 us | +2.5% | +9.0% |
| bs=1000 sw=10 sl=64 | p95 | 1,130,014 us | 1,116,531 us | 1,042,923 us | +1.2% | +8.4% |
| bs=1000 sw=10 sl=64 | p99 | 1,130,014 us | 1,116,531 us | 1,074,893 us | +1.2% | +5.1% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,545.40,200,128000,367,0.224,25387.73,38533.42,38533.42
1,100,10,64,20,2453.33,2000,1280000,815,0.498,120046.93,146191.26,146191.26
2,1000,10,64,20,21784.77,20000,12800000,918,0.560,1092413.26,1130013.72,1130013.72
❌ 2 Tests Failed:
View the top 2 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
|
/request-review @aglinxinyuan |
|
that's not a final fix right? if medium dataset is also processed fast enough, the same issue could happen? |
There was a problem hiding this comment.
Pull request overview
This PR aims to stabilize ReconfigurationIntegrationSpec by ensuring the CSV-backed workflows run long enough for pauseWorkflow to take effect, avoiding a race where workflows can complete before the pause is applied.
Changes:
- Introduces a helper (
boundedCsvSource) to create a CSV scan operator descriptor based on the medium CSV source. - Switches the two CSV-sourced reconfiguration tests to use the new helper instead of
smallCsvScanOpDesc.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| private def boundedCsvSource() = { | ||
| val src = TestOperators.mediumCsvScanOpDesc() | ||
| src.limit = Some(10000) | ||
| src | ||
| } |
What changes were proposed in this PR?
ReconfigurationIntegrationSpecfromsmallCsvScanOpDesc(100 rows) tomediumCsvScanOpDesc(100k rows), so the workflow is still running whenpauseWorkflowlands.PauseHandler.pauseWorkflowemitted COMPLETED instead of PAUSED andTestUtils.shouldReconfigures 10spausedReachedawait timed out.ReconfigurationSpecandPauseSpec, which already usemediumCsvScanOpDescfor the same pause/reconfigure/resume flow.Any related issues, documentation, discussions?
Closes: #5913
How was this PR tested?
sbt "WorkflowExecutionService/Test/compile"compiles cleanly with the change.amber-integrationstack (Python UDF + Postgres/MinIO/Lakekeeper) that CI provisions; could not run locally. Reviewer: run theamber-integrationjob (orsbt "WorkflowExecutionService/testOnly *ReconfigurationIntegrationSpec"in that environment) and confirm all 3 tests pass across repeated runs.workflow pausedlog line for every test run (the flaky run showed fewer pauses than test runs).Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF