Skip to content

Tracking Issue: Pluggable Compressor #7216

@connortsui20

Description

@connortsui20

Pluggable Compressor

This is a tracking issue for extracting a pluggable compression framework (vortex-compressor) from vortex-btrblocks.

Related: #6872

Motivation

The vortex-btrblocks compressor currently depends on every encoding crate in the workspace, and extension types (Vector, UUID, Tensor, JSON) have no mechanism for type-specific compression. We need to invert the dependency graph so that encoding crates can implement compression schemes without circular dependencies.

Theoretically, there was a way to hack pluggability into the existing compressor without a complete rewrite, but it would not provide the level of expressiveness needed to fully support extension types and encodings as a first-class citizen.

Design

  • Unified Scheme trait replaces the old type-specific IntegerScheme/FloatScheme/StringScheme traits and IntCode/FloatCode/StringCode enums. Schemes are identified by opaque SchemeId (obtained only via SchemeExt::id()). The old Compressor/CompressorExt/CanonicalCompressor traits and IntCompressor/FloatCompressor/StringCompressor structs are replaced by a CascadingCompressor that selects from a vec of &'static dyn Scheme.
  • The new vortex-compressor crate contains the framework (trait definitions, cascading compressor, stats, sampling) with zero encoding dependencies (other than built-in ones from vortex-array).
  • ArrayAndStats bundle replaces the old pattern of passing arrays and stats caches separately. Stats are generated lazily on first access via typed methods (integer_stats(), float_stats(), string_stats()). Each scheme declares any expensive required stats via stats_options() (specifically, distinct values and their frequencies via a hash map), and the compressor merges all eligible schemes' options before generating stats so that expensive computations only run when needed.
  • vortex-btrblocks remains the batteries-included writer, except now it depends on vortex-compressor for the compression logic and registers all encoding-specific schemes (BitPacking, FoR, ALP, FSST, etc.) that we host in encodings/.
  • Declarative exclusion rules replace manual new_excludes vectors. Schemes declare descendant_exclusions (push) and ancestor_exclusions (pull) to prevent incompatible combinations in the cascade chain. The compressor enforces these automatically along with self-exclusion (no scheme appears twice in a chain). We do this specifically to avoid a dependency cycle.
  • compress_child encapsulates cascade budget tracking. Schemes call compressor.compress_child(array, &ctx, self.id(), child_index) instead of manually building contexts and calling compress_canonical. If the cascade budget is exhausted, the child is returned as-is.
  • Decimal and temporal compression converted from hardcoded compress_canonical branches into Scheme implementations (DecimalScheme, TemporalScheme), registered in ALL_SCHEMES just like any other scheme.

Note that essentially none of the scheme logic was changed (so the estimation and compress logic is all mostly identical to before). The main things that were changed consist of just the framework around schemes.

Observations

The current mechanism in which we stop cascading is not very smart (there are 3 levels and then that's it). The old compressor had this simply to have bounded search.

With this new compressor and its exclusion system (where we always exclude the scheme that we just applied for future cascades), the entire search space is now bounded, and it's actually a tree (not even a DAG!). If there are cases where we actually want to apply the same scheme to children (maybe for some reason that I don't understand, you want dict applied to run end runs applied to dict codes, but I doubt that this is good) we can get around that by creating a new leaf scheme (though again, this sounds bad anyways).

So we can be a lot smarter with deciding how to search because the search space is known on construction of the compressor. Instead of a hardcoded level, we could make decisions based on how expensive running a compression scheme is. There is lots of potential here for improvement.

Steps / History

  • Big refactor: Pluggable Compressor #7018
  • Smarter scheme selection, expected_compression_ratio should return better info: Rework Scheme estimation in compressor #7230
  • Add a ResolvedEstimate type for delayed estimate computation
  • Be smart about delayed estimates
  • Hardcode ConstantScheme logic into the compressor?
  • Replace hardcoded MAX_CASCADE = 3 with adaptive cascading that leverages the exclusion-bounded search tree
  • Fine-grained stats collection (not all schemes need all stats)

Unresolved Questions

  • How much of the vortex-btrblocks public API should we preserve via re-exports?
  • Should MAX_CASCADE be replaced with an adaptive search strategy now that the exclusion system bounds the search space?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions