Skip to content

Improve sqllogicteset speed by creating only a single large file rather than 2#20586

Draft
Tim-53 wants to merge 4 commits intoapache:mainfrom
Tim-53:perf/reuse-parquet-file-push-down-filter-regression
Draft

Improve sqllogicteset speed by creating only a single large file rather than 2#20586
Tim-53 wants to merge 4 commits intoapache:mainfrom
Tim-53:perf/reuse-parquet-file-push-down-filter-regression

Conversation

@Tim-53
Copy link
Contributor

@Tim-53 Tim-53 commented Feb 26, 2026

Draft as it builds on #20576

Which issue does this PR close?

Rationale for this change

Execution time of the test is dominated by the time writing the parquet files. By reusing the file we can gain around 30% improvement on the execution time here.

What changes are included in this PR?

Building on #20576 we reuse the needed parquet file for the test instead of recreating it.

Are these changes tested?

Ran the test with following results:

Baseline (2 files) Optimized (1 file)
Min 33.000s 22.653s
Max 37.662s 25.489s
Avg 34.427s 24.092s

One open question: does the correctness of this regression test rely on having two physically separate files? The race condition in #17197 was in the execution layer — both scans would still be independent DataSourceExec nodes with independent readers, so I believe the behavior is preserved. But if there's any concern, we could use system cp to copy the file and register two physical files while still only paying the generate_series cost once.

Are there any user-facing changes?

kosiew and others added 4 commits February 26, 2026 13:50
- Implement tests for push down filters in outer joins, ensuring filters are applied correctly based on join conditions.
- Introduce tests for push down filters with Parquet files, including scenarios with limits and dynamic filters.
- Add regression tests to address specific issues related to filter pushdown, ensuring stability and correctness.
- Include tests for unnest operations with filters, verifying that filters are pushed down appropriately based on the context.
@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Feb 26, 2026
@alamb alamb changed the title Perf/reuse parquet file push down filter regression Improve sqllogicteset speed by creating only a single large file rather than 2 Feb 27, 2026
@alamb
Copy link
Contributor

alamb commented Feb 27, 2026

Thank you 🙏

I left a note on

Asking the original authors if they could double check

}
}

// trigger ci test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed? Also just in case it's helpful: git commit -m "ci" --allow-empty --no-verify

Copy link
Contributor

@alamb alamb Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think this is left over from #20566 -- when this PR gets rebased it should be removed)

Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the test needs two physically distinct files. As long as it's two different execution nodes that should be good enough!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants