refactor: make file-statistics cache keys schema-aware#23201
refactor: make file-statistics cache keys schema-aware#23201Phoenix500526 wants to merge 5 commits into
Conversation
44b33de to
a13e269
Compare
|
I am going to run the |
|
run bechmark wide_schema env:
DATAFUSION_RUNTIME_FILE_STATISTICS_CACHE_LIMIT: 0 |
|
run bechmark wide_schema |
|
run bechmark wide_schema baseline:
env:
DATAFUSION_RUNTIME_FILE_STATISTICS_CACHE_LIMIT: 0 |
This is cool, i did not know this. |
|
run benchmark wide_schema env:
DATAFUSION_RUNTIME_FILE_STATISTICS_CACHE_LIMIT: 0 |
|
run benchmark wide_schema baseline:
env:
DATAFUSION_RUNTIME_FILE_STATISTICS_CACHE_LIMIT: 0 |
|
run benchmark wide_schema |
Yep the idea here is to run baseline w/o cache and this branch w/ cache. Orthogonal to this PR but I want to see how it looks like. Unfortunately I had a typo in benchmark, sorry for the noise. |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing issue/23072 (a13e269) to ff677c4 (merge-base) diff using: wide_schema File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing issue/23072 (a13e269) to ff677c4 (merge-base) diff using: wide_schema File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing issue/23072 (a13e269) to ff677c4 (merge-base) diff using: wide_schema File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagewide_schema — base (merge-base)
wide_schema — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagewide_schema — base (merge-base)
wide_schema — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagewide_schema — base (merge-base)
wide_schema — branch
File an issue against this benchmark runner |
|
Unfortunately we have regressions in the benchmarks: |
|
run benchmark wide_schema baseline:
env:
DATAFUSION_RUNTIME_FILE_STATISTICS_CACHE_LIMIT: 0 |
|
run benchmark wide_schema |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing issue/23072 (9618f88) to ff677c4 (merge-base) diff using: wide_schema File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing issue/23072 (9618f88) to ff677c4 (merge-base) diff using: wide_schema File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagewide_schema — base (merge-base)
wide_schema — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagewide_schema — base (merge-base)
wide_schema — branch
File an issue against this benchmark runner |
File statistics are computed against a specific `file_schema`, but the file-statistics cache was keyed only by table and path. Reading the same path under a different schema could reuse statistics whose `column_statistics` no longer line up, panicking during statistics projection. apache#22950 worked around this by bypassing the cache entirely for anonymous explicit-schema reads, at the cost of losing cache reuse for them. Introduce a `SchemaFingerprint` (per-column name, type and nullability, derived from `file_schema`) and a `FileStatisticsCacheKey { table, path, schema }`, and key the file-statistics cache on it. Different schemas now get distinct entries (no stale cross-schema reuse) while a repeated read of the same schema reuses its entry, so the apache#22950 bypass is removed and anonymous explicit-schema reads cache safely again. - The fingerprint excludes field/schema metadata (cannot affect statistics) and partition columns (their statistics are computed separately). - Table-drop invalidation is unchanged: drop_table_entries matches on CacheKey::table_ref(), which still returns the table, so all schema variants for a table are removed together. - The list-files cache continues to key on TableScopedPath. Closes apache#23072. Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
…apSize Add a `DFHeapSize` impl for 3-tuples (mirroring the existing 2-tuple one) so `Vec<(String, DataType, bool)>` accounts for its heap automatically, letting `SchemaFingerprint::heap_size` delegate to it instead of computing the size by hand. Also update the `test_statistics_cache` unit test to key on `FileStatisticsCacheKey` so it matches the real file-statistics cache. Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
`do_collect_statistics_and_ordering` rebuilt the `SchemaFingerprint` for every file, deep-cloning all column names and types — O(files x schema width) of redundant work, since `file_schema` is constant for a table. Compute the fingerprint once in `ListingTable::try_new` and store it as `Arc<SchemaFingerprint>`; `FileStatisticsCacheKey.schema` now holds the `Arc`, so building a key per file is an O(1) refcount bump instead of a deep clone. `Arc`'s `Eq`/`Hash` compare the inner value, so cache keying remains by schema contents. Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
|
run benchmark clickbench_partitioned |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing issue/23072 (9618f88) to ff677c4 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
Since it's an api-change, I think it makes sense to add an entry to the upgrade guide in https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/upgrading/55.0.0.md. |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmark clickbench_partitioned |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing issue/23072 (9618f88) to ff677c4 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
It looks like there are regressions in the |
| // fingerprint: reads of the same path under a different schema get | ||
| // their own entry rather than reusing incompatible column statistics. | ||
| // The fingerprint is precomputed once per table (see `try_new`). | ||
| schema: Arc::clone(&self.file_schema_fingerprint), |
There was a problem hiding this comment.
Precalculating the hash for the fingerprint could be a solution to fix the regression. Right now we calculate the hash for the fingerprint for each entry which is expensive.
There was a problem hiding this comment.
Thanks for the comment. I've precalcuated hash for the fingerprint. It seems that only whitelisted users can trigger benchmark. Could you help trigger one?
Hashing a FileStatisticsCacheKey on every cache lookup previously digested the entire file schema (O(schema width)). Store a fixed-seed hash of the fingerprint columns, computed once in from_schema, and feed only that u64 into the map hasher. PartialEq still compares the columns exactly, so a hash collision can never make two different schemas share a cache entry. Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
9618f88 to
a767586
Compare
|
run benchmark wide_schema baseline:
env:
DATAFUSION_RUNTIME_FILE_STATISTICS_CACHE_LIMIT: 0 |
|
Hi @Phoenix500526, thanks for the request (#23201 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Fokko, Jefffrey, Omega359, Rachelint, adriangb, alamb, asubiotto, brunal, buraksenn, cetra3, codephage2020, coderfender, comphead, erenavsarogullari, etseidl, friendlymatthew, gabotechs, geoffreyclaude, grtlr, haohuaijin, jonathanc-n, kevinjqliu, klion26, kosiew, kumarUjjawal, kunalsinghdadhwal, liamzwbao, mbutrovich, mkleen, mzabaluev, neilconway, rluvaton, sdf-jkl, timsaucer, xudong963, zhuqi-lucas. File an issue against this benchmark runner |
|
run benchmark clickbench_partitioned |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing issue/23072 (a767586) to d58e0c6 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmark clickbench_partitioned |
|
run benchmark wide_schema baseline:
env:
DATAFUSION_RUNTIME_FILE_STATISTICS_CACHE_LIMIT: 0 |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing issue/23072 (a767586) to d58e0c6 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing issue/23072 (a767586) to d58e0c6 (merge-base) diff using: wide_schema File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagewide_schema — base (merge-base)
wide_schema — branch
File an issue against this benchmark runner |
…uide The file-statistics cache key changed from TableScopedPath to a schema-aware FileStatisticsCacheKey. Add an upgrade-guide entry covering the type-alias change and how to migrate custom cache implementations and direct get/put callers. Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
Added |
|
Looks like all regressions are fixed now. TBH this is quite a complicated solution for a problem which would not exist if we simply avoid caching in this case. |
|
@kosiew Do you maybe have time for a second opinion? |
There was a problem hiding this comment.
@Phoenix500526
Thanks for the fix. I think the schema-aware cache key is the right direction, but I think the implementation can be simplified a bit before this lands.
| /// nullability, in order. It deliberately excludes field/schema metadata, which | ||
| /// cannot affect statistics — including it would needlessly fragment the cache. | ||
| #[derive(Clone, Debug)] | ||
| pub struct SchemaFingerprint { |
There was a problem hiding this comment.
The schema-aware key looks correct, and I think this fixes the bug. That said, the implementation feels a bit more involved than this cache path needs.
Could we simplify SchemaFingerprint to a small derived newtype over something like Vec<(String, DataType, bool)>, or an equivalent representation, and rely on derived Hash and Eq? The precomputed hash plus custom PartialEq collision handling adds some cleverness that feels hard to justify here unless profiling shows schema-key hashing is material.
There was a problem hiding this comment.
If you have a look at the benchmarks results clickbench_partitioned and wide_schema you will find that without this optimization there are real regressions.
| fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize { | ||
| self.path.as_ref().heap_size(ctx) | ||
| + self.table.heap_size(ctx) | ||
| + self.schema.as_ref().heap_size(ctx) |
There was a problem hiding this comment.
FileStatisticsCacheKey::heap_size appears to deep-count the shared SchemaFingerprint for every cached file key. Since ListingTable now shares one Arc<SchemaFingerprint> across all files for the same table and schema, this could overstate cache memory for wide schemas with many files and lead to earlier eviction than necessary.
Could we count only the incremental cache-owned cost here, or add a small test that documents the intended accounting tradeoff?
Which issue does this PR close?
Rationale for this change
File statistics are computed against a specific
file_schema(theircolumn_statisticsare positional, one per column), but the file-statisticscache was keyed only by table and path. Reading the same path under a different
schema could therefore reuse statistics whose columns no longer line up,
panicking during statistics projection.
#22950 worked around this by bypassing the file-statistics cache entirely for
anonymous explicit-schema reads — correct, but it gave up cache reuse for them
(every such read recomputes statistics). #23072 asks to make the cache itself
schema-aware so those reads can reuse the cache safely instead of skipping it.
What changes are included in this PR?
SchemaFingerprint— the per-column(name, data_type, nullable)of afile_schema, in order — andFileStatisticsCacheKey { table, path, schema },and key the file-statistics cache on it (
FileStatisticsCacheis nowdyn Cache<FileStatisticsCacheKey, CachedFileMetadata>).ListingTable::do_collect_statistics_and_orderingbuilds the key with thefile_schemafingerprint and uses the shared cache directly. The fix: isolate anonymous file statistics cache #22950bypass (
statistics_cachehelper /schema_source-based skip) is removed:different schemas now land in distinct entries (no stale cross-schema reuse),
while a repeated read of the same schema reuses its entry.
affect statistics, and including it would needlessly fragment the cache) and
partition columns (partition statistics are computed separately, outside this
cache).
drop_table_entriesmatches onCacheKey::table_ref(), which still returns the table, so all schema variantsfor a dropped table are removed together.
TableScopedPath.Are these changes tested?
Yes.
(
anonymous_parquet_stats_cache_with_explicit_wider_schema): the widerexplicit-schema read now lands in its own cache entry (2 entries, was 1 under
the bypass) with correct statistics and no panic, and a repeated read of that
schema is served from the cache (a cache hit, no new entry).
SchemaFingerprint: it distinguishes nullability andfield order, and ignores field/schema metadata.
cargo testfor thefile_statisticsintegration module and thedatafusion-executioncache tests (includingdrop_table_entries) pass, alongwith
cargo fmt --allandcargo clippy --all-targets --all-features -- -D warningsfor the touched crates.Are there any user-facing changes?
No change to query results, physical plans, or the serialized (proto) wire
format; file statistics are computed exactly as before.
One public API change (please add the
api changelabel): theFileStatisticsCachetype alias now usesFileStatisticsCacheKeyinstead ofTableScopedPathas its key. Code that constructed keys for this cache directlymust switch to
FileStatisticsCacheKey.SchemaFingerprintandFileStatisticsCacheKeyare newly public;TableScopedPathremains (still usedby the list-files cache).
cargo-semver-checkswill flag the key-type change,which is expected.