Tracking Issue: Pluggable Compressor

# Pluggable Compressor

This is a tracking issue for extracting a pluggable compression framework (`vortex-compressor`) from `vortex-btrblocks`.

Related: https://github.com/vortex-data/vortex/issues/6872

## Motivation

The `vortex-btrblocks` compressor currently depends on every encoding crate in the workspace, and extension types (Vector, UUID, Tensor, JSON) have no mechanism for type-specific compression. We need to invert the dependency graph so that encoding crates can implement compression schemes without circular dependencies.

_Theoretically, there was a way to hack pluggability into the existing compressor without a complete rewrite, but it would not provide the level of expressiveness needed to fully support extension types and encodings as a first-class citizen._

## Design

- Unified `Scheme` trait replaces the old type-specific `IntegerScheme`/`FloatScheme`/`StringScheme` traits and `IntCode`/`FloatCode`/`StringCode` enums. Schemes are identified by opaque `SchemeId` (obtained only via `SchemeExt::id()`). The old `Compressor`/`CompressorExt`/`CanonicalCompressor` traits and `IntCompressor`/`FloatCompressor`/`StringCompressor` structs are replaced by a `CascadingCompressor` that selects from a vec of `&'static dyn Scheme`.
- The new `vortex-compressor` crate contains the framework (trait definitions, cascading compressor, stats, sampling) with zero encoding dependencies (other than built-in ones from `vortex-array`).
- `ArrayAndStats` bundle replaces the old pattern of passing arrays and stats caches separately. Stats are generated lazily on first access via typed methods (`integer_stats()`, `float_stats()`, `string_stats()`). Each scheme declares any expensive required stats via `stats_options()` (specifically, distinct values and their frequencies via a hash map), and the compressor merges all eligible schemes' options before generating stats so that expensive computations only run when needed.
- `vortex-btrblocks` remains the batteries-included writer, except now it depends on `vortex-compressor` for the compression logic and registers all encoding-specific schemes (BitPacking, FoR, ALP, FSST, etc.) that we host in `encodings/`.
- Declarative exclusion rules replace manual `new_excludes` vectors. Schemes declare `descendant_exclusions` (push) and `ancestor_exclusions` (pull) to prevent incompatible combinations in the cascade chain. The compressor enforces these automatically along with self-exclusion (no scheme appears twice in a chain). We do this specifically to avoid a dependency cycle.
- `compress_child` encapsulates cascade budget tracking. Schemes call `compressor.compress_child(array, &ctx, self.id(), child_index)` instead of manually building contexts and calling `compress_canonical`. If the cascade budget is exhausted, the child is returned as-is.
- Decimal and temporal compression converted from hardcoded `compress_canonical` branches into `Scheme` implementations (`DecimalScheme`, `TemporalScheme`), registered in `ALL_SCHEMES` just like any other scheme.

Note that essentially none of the scheme logic was changed (so the estimation and compress logic is all mostly identical to before). The main things that were changed consist of just the framework around schemes.

## Observations

The current mechanism in which we stop cascading is not very smart (there are 3 levels and then that's it). The old compressor had this simply to have bounded search.

With this new compressor and its exclusion system (where we _always_ exclude the scheme that we just applied for future cascades), the entire search space is now bounded, and it's actually a tree (not even a DAG!). If there are cases where we actually want to apply the same scheme to children (maybe for some reason that I don't understand, you want dict applied to run end runs applied to dict codes, but I doubt that this is good) we can get around that by creating a new leaf scheme (though again, this sounds bad anyways).

So we can be a lot smarter with deciding how to search because the search space is known on construction of the compressor. Instead of a hardcoded level, we could make decisions based on how expensive running a compression scheme is. There is lots of potential here for improvement.

## Steps / History

- [x] Big refactor: https://github.com/vortex-data/vortex/pull/7018
- [ ] Smarter scheme selection, `expected_compression_ratio` should return better info: https://github.com/vortex-data/vortex/pull/7230
- [ ] Add a `ResolvedEstimate` type for delayed estimate computation
- [ ] Be smart about delayed estimates
- [ ] Hardcode `ConstantScheme` logic into the compressor?
- [ ] Replace hardcoded `MAX_CASCADE = 3` with adaptive cascading that leverages the exclusion-bounded search tree
- [ ] Fine-grained stats collection (not all schemes need all stats)


## Unresolved Questions

- How much of the `vortex-btrblocks` public API should we preserve via re-exports?
- Should `MAX_CASCADE` be replaced with an adaptive search strategy now that the exclusion system bounds the search space?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issue: Pluggable Compressor #7216

Pluggable Compressor

Motivation

Design

Observations

Steps / History

Unresolved Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tracking Issue: Pluggable Compressor #7216

Description

Pluggable Compressor

Motivation

Design

Observations

Steps / History

Unresolved Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions