-
Notifications
You must be signed in to change notification settings - Fork 145
Tracking Issue: Pluggable Compressor #7216
Description
Pluggable Compressor
This is a tracking issue for extracting a pluggable compression framework (vortex-compressor) from vortex-btrblocks.
Related: #6872
Motivation
The vortex-btrblocks compressor currently depends on every encoding crate in the workspace, and extension types (Vector, UUID, Tensor, JSON) have no mechanism for type-specific compression. We need to invert the dependency graph so that encoding crates can implement compression schemes without circular dependencies.
Theoretically, there was a way to hack pluggability into the existing compressor without a complete rewrite, but it would not provide the level of expressiveness needed to fully support extension types and encodings as a first-class citizen.
Design
- Unified
Schemetrait replaces the old type-specificIntegerScheme/FloatScheme/StringSchemetraits andIntCode/FloatCode/StringCodeenums. Schemes are identified by opaqueSchemeId(obtained only viaSchemeExt::id()). The oldCompressor/CompressorExt/CanonicalCompressortraits andIntCompressor/FloatCompressor/StringCompressorstructs are replaced by aCascadingCompressorthat selects from a vec of&'static dyn Scheme. - The new
vortex-compressorcrate contains the framework (trait definitions, cascading compressor, stats, sampling) with zero encoding dependencies (other than built-in ones fromvortex-array). ArrayAndStatsbundle replaces the old pattern of passing arrays and stats caches separately. Stats are generated lazily on first access via typed methods (integer_stats(),float_stats(),string_stats()). Each scheme declares any expensive required stats viastats_options()(specifically, distinct values and their frequencies via a hash map), and the compressor merges all eligible schemes' options before generating stats so that expensive computations only run when needed.vortex-btrblocksremains the batteries-included writer, except now it depends onvortex-compressorfor the compression logic and registers all encoding-specific schemes (BitPacking, FoR, ALP, FSST, etc.) that we host inencodings/.- Declarative exclusion rules replace manual
new_excludesvectors. Schemes declaredescendant_exclusions(push) andancestor_exclusions(pull) to prevent incompatible combinations in the cascade chain. The compressor enforces these automatically along with self-exclusion (no scheme appears twice in a chain). We do this specifically to avoid a dependency cycle. compress_childencapsulates cascade budget tracking. Schemes callcompressor.compress_child(array, &ctx, self.id(), child_index)instead of manually building contexts and callingcompress_canonical. If the cascade budget is exhausted, the child is returned as-is.- Decimal and temporal compression converted from hardcoded
compress_canonicalbranches intoSchemeimplementations (DecimalScheme,TemporalScheme), registered inALL_SCHEMESjust like any other scheme.
Note that essentially none of the scheme logic was changed (so the estimation and compress logic is all mostly identical to before). The main things that were changed consist of just the framework around schemes.
Observations
The current mechanism in which we stop cascading is not very smart (there are 3 levels and then that's it). The old compressor had this simply to have bounded search.
With this new compressor and its exclusion system (where we always exclude the scheme that we just applied for future cascades), the entire search space is now bounded, and it's actually a tree (not even a DAG!). If there are cases where we actually want to apply the same scheme to children (maybe for some reason that I don't understand, you want dict applied to run end runs applied to dict codes, but I doubt that this is good) we can get around that by creating a new leaf scheme (though again, this sounds bad anyways).
So we can be a lot smarter with deciding how to search because the search space is known on construction of the compressor. Instead of a hardcoded level, we could make decisions based on how expensive running a compression scheme is. There is lots of potential here for improvement.
Steps / History
- Big refactor: Pluggable Compressor #7018
- Smarter scheme selection,
expected_compression_ratioshould return better info: ReworkSchemeestimation in compressor #7230 - Add a
ResolvedEstimatetype for delayed estimate computation - Be smart about delayed estimates
- Hardcode
ConstantSchemelogic into the compressor? - Replace hardcoded
MAX_CASCADE = 3with adaptive cascading that leverages the exclusion-bounded search tree - Fine-grained stats collection (not all schemes need all stats)
Unresolved Questions
- How much of the
vortex-btrblockspublic API should we preserve via re-exports? - Should
MAX_CASCADEbe replaced with an adaptive search strategy now that the exclusion system bounds the search space?