Add between pushdown kernel for DecimalByteParts#8097
Conversation
DecimalByteParts already pushed `compare` against a constant down to its numeric MSP child. This adds the symmetric `between` kernel so that bounded-range predicates are evaluated directly on the compact MSP representation instead of canonicalizing to a wide DecimalArray. Both bounds are converted to the MSP's physical integer type and the comparison is delegated to the MSP's own `between`. When a bound falls outside the MSP's integer range the kernel falls back to the canonical decimal `between`, which already handles the overflow directions. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| 🆕 | Simulation | vortex_byteparts_i32[131072] |
N/A | 361.4 µs | N/A |
| 🆕 | Simulation | vortex_byteparts_twolimb[131072] |
N/A | 1.4 ms | N/A |
| 🆕 | Simulation | vortex_byteparts_twolimb[65536] |
N/A | 726 µs | N/A |
| 🆕 | Simulation | vortex_canonical_i128[131072] |
N/A | 2.9 ms | N/A |
| 🆕 | Simulation | vortex_canonical_i128[65536] |
N/A | 768.7 µs | N/A |
| 🆕 | Simulation | vortex_byteparts_i32[65536] |
N/A | 210.9 µs | N/A |
| 🆕 | Simulation | vortex_byteparts_twolimb[65536] |
N/A | 1.1 ms | N/A |
| 🆕 | Simulation | arrow_decimal128[131072] |
N/A | 1.9 ms | N/A |
| 🆕 | Simulation | vortex_byteparts_i32[65536] |
N/A | 207.1 µs | N/A |
| 🆕 | Simulation | vortex_byteparts_twolimb[131072] |
N/A | 2.2 ms | N/A |
| 🆕 | Simulation | vortex_canonical_i128[131072] |
N/A | 1.5 ms | N/A |
| 🆕 | Simulation | vortex_byteparts_i32[131072] |
N/A | 378.2 µs | N/A |
| 🆕 | Simulation | vortex_byteparts_i64[131072] |
N/A | 597.1 µs | N/A |
| 🆕 | Simulation | arrow_decimal128[65536] |
N/A | 981.2 µs | N/A |
| 🆕 | Simulation | vortex_byteparts_i64[65536] |
N/A | 316.5 µs | N/A |
| 🆕 | Simulation | arrow_decimal128[131072] |
N/A | 1.4 ms | N/A |
| 🆕 | Simulation | vortex_byteparts_i64[65536] |
N/A | 327.2 µs | N/A |
| 🆕 | Simulation | vortex_byteparts_i64[131072] |
N/A | 591.7 µs | N/A |
| 🆕 | Simulation | arrow_decimal128[65536] |
N/A | 717.5 µs | N/A |
| 🆕 | Simulation | vortex_canonical_i128[65536] |
N/A | 1.5 ms | N/A |
Comparing claude/decimal-numeric-comparison-6W0Mt (990d2f6) with develop (6ddc4d5)
Benchmarks `between` over DecimalByteParts (i32/i64 MSP pushdown), the canonical i128 DecimalArray, and arrow-rs Decimal128. Demonstrates that pushing the comparison down to a narrow MSP beats arrow-rs (~2.4-2.7x at 1M rows), since arrow has no decimal storage narrower than 128 bits. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Drop the 1M-row case (arrow/canonical exceeded 1ms there) and use 64k/128k rows, keeping all four engines under a 1ms-per-op budget while preserving the cross-engine comparison. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…ByteParts Implements the previously-scaffolded second limb: a decimal i128 can now be stored as a signed i64 high limb (msp, carrying validity) plus a non-nullable u64 low limb, reconstructed as (msp as i128) << 64 | low. Wires the lower limb through the constructor, validation, serde, canonicalization, scalar_at, and slice/filter/take/mask/cast; compare and is_constant defer to the canonical path for two-limb arrays. Adds a lexicographic two-limb `between` pushdown that compares the limbs with native-width integer ops. Correctness is verified against the canonical i128 implementation across strictness modes and nulls, plus two-limb consistency cases. Benchmark note: the pushdown composes generic array ops (~11 passes with intermediate Bool allocations) and is currently slower than both arrow-rs Decimal128 (~2.7x) and the canonical i128 path on wide values. Beating arrow requires a fused single-pass SIMD kernel rather than composed expressions; the benchmark (decimal_between, wide cases) documents the current state. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The previous two-limb `between` composed ~11 generic array ops, each a full pass allocating an intermediate Bool array, which was slower than both arrow and the canonical i128 path. This replaces it with a single fused loop that materializes the two limbs once and computes the lexicographic comparison per row with native-width (i64/u64) integer ops via BitBuffer::collect_bool, using branch-free bitwise combines so the body vectorizes. On wide i128 values this cuts two-limb `between` from ~605us to ~267us at 131k rows, making it ~1.8x faster than the canonical i128 path. Correctness is unchanged and still verified against the canonical implementation. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The hand-rolled lexicographic comparison (6 native compares + bitwise combine per row) did more per-element work than a single hardware i128 compare, leaving two-limb between slower than arrow. Reconstructing the i128 from its signed-high / unsigned-low limbs and comparing it directly is ~2 instructions per bound, and the single fused pass reads each limb once where arrow reads its i128 array twice (one pass each for gt_eq/lt_eq). At the benchmarked cache-resident sizes this makes two-limb between faster than arrow's Decimal128 (~1.1-1.2x) and ~1.9x faster than the canonical i128 path, while staying under the 1ms-per-op budget. Correctness is unchanged and still verified against the canonical implementation. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The two-limb representation previously deferred all `compare` operators to the canonical i128 path. Reuse the fused reconstruct approach from `between`: a single pass rebuilds each i128 from its (signed high, unsigned low) limbs and applies the comparison, reading each limb once. This specializes all six operators (eq/neq/lt/lte/gt/gte), not just lt, since they share one loop. Factor the shared limb materialization, i128 reconstruct, and i128 comparator helpers into a `two_limb` module used by both `between` and `compare`. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Benchmarks single-sided `x < threshold` across the i32/i64 MSP pushdown, the two-limb i128 kernel, the canonical i128 path, and arrow-rs cmp::lt. Unlike between, lt is a single pass for every engine, which isolates the cost of the two-limb reconstruct against a contiguous i128 read. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Isolates the kernel cost of wide-i128 `lt` with no Vortex expression overhead. arrow's i128 has no vector compare on any x86, so it is inherently scalar (raw_i128_scalar tracks arrow). The (i64 high, u64 low) limbs are native SIMD widths: an AVX-512 kernel compares 8 lanes with vpcmpq/vpcmpuq straight to a __mmask8, which is exactly the packed-bit output, beating arrow ~1.4-2.1x at the measured sizes. The AVX-512 output is cross-checked against the scalar reference before timing. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The two-limb compare/between kernels now dispatch to an AVX-512 path when the host supports it, falling back to the scalar i128-reconstruct path otherwise. The AVX-512 kernel compares 8 lanes of the signed-high i64 and unsigned-low u64 limbs with vpcmpq/vpcmpuq, combining them lexicographically into a __mmask8 that is written directly as one byte of the output bitmap, with no serial bit-packing. arrow's i128 has no vector-compare form on x86, so this is a comparison the limb representation can vectorize and arrow cannot. Dispatch follows the existing take/avx2 pattern (is_x86_feature_detected gate). Operators are monomorphized via const generics so the hot loop carries no operator branch. Correctness is validated against the canonical implementation including a 99-element (non-multiple-of-8) two-limb case that exercises both the vectorized main loop and the scalar tail. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.956x ➖ datafusion / vortex-file-compressed (0.956x ➖, 3↑ 0↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.026x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.019x ➖, 0↑ 1↓)
datafusion / parquet (1.024x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.035x ➖, 0↑ 1↓)
duckdb / vortex-compact (0.997x ➖, 1↑ 1↓)
duckdb / parquet (1.060x ➖, 0↑ 2↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.004x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.987x ➖, 0↑ 0↓)
datafusion / parquet (0.992x ➖, 1↑ 1↓)
datafusion / arrow (1.009x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.007x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.001x ➖, 0↑ 0↓)
duckdb / parquet (1.002x ➖, 1↑ 0↓)
duckdb / duckdb (0.996x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.023x ➖, 1↑ 2↓)
datafusion / vortex-compact (1.019x ➖, 0↑ 5↓)
datafusion / parquet (0.995x ➖, 2↑ 1↓)
duckdb / vortex-file-compressed (1.012x ➖, 1↑ 5↓)
duckdb / vortex-compact (0.974x ➖, 5↑ 0↓)
duckdb / parquet (1.020x ➖, 0↑ 5↓)
duckdb / duckdb (0.971x ➖, 0↑ 1↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.944x ➖, 1↑ 0↓)
datafusion / vortex-compact (1.024x ➖, 0↑ 0↓)
datafusion / parquet (0.931x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.951x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.051x ➖, 0↑ 1↓)
duckdb / parquet (0.933x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.965x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.976x ➖, 0↑ 0↓)
duckdb / parquet (0.986x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.952x ➖, 1↑ 0↓)
datafusion / vortex-compact (0.952x ➖, 1↑ 0↓)
datafusion / parquet (0.963x ➖, 0↑ 0↓)
datafusion / arrow (0.961x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.987x ➖, 1↑ 0↓)
duckdb / vortex-compact (1.082x ➖, 0↑ 11↓)
duckdb / parquet (1.002x ➖, 0↑ 0↓)
duckdb / duckdb (1.012x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Audit of the added benches: - Remove limb_simd: it duplicated the AVX-512 kernel that now lives in two_limb.rs (drift risk), benchmarked a lexicographic-scalar kernel that ships nowhere, and its 1<<20 case exceeded the 1ms-per-op budget. The productized path is now measured by decimal_lt/decimal_between. - Drop the `_wide` arrow and canonical baselines from both files: an i128 comparison's cost is independent of values and precision/scale, so they were provably identical to their narrow counterparts (measured 51.4 vs 51.4us). One arrow + one canonical i128 baseline now serves both narrow and wide. Every remaining case stays well under 1ms. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Benchmarks: Clickbench on NVMEVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.919x ➖, 11↑ 0↓)
datafusion / parquet (0.956x ➖, 8↑ 0↓)
duckdb / vortex-file-compressed (0.941x ➖, 15↑ 4↓)
duckdb / parquet (0.944x ➖, 13↑ 0↓)
duckdb / duckdb (1.017x ➖, 6↑ 6↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.835x ➖, 4↑ 1↓)
datafusion / vortex-compact (0.857x ➖, 2↑ 0↓)
datafusion / parquet (0.755x ➖, 7↑ 0↓)
duckdb / vortex-file-compressed (0.982x ➖, 0↑ 1↓)
duckdb / vortex-compact (0.942x ➖, 0↑ 0↓)
duckdb / parquet (0.982x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Random AccessVortex (geomean): 1.045x ➖ unknown / unknown (1.030x ➖, 0↑ 4↓)
|
Benchmarks: CompressionVortex (geomean): 0.989x ➖ unknown / unknown (0.981x ➖, 2↑ 0↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.862x ➖, 3↑ 0↓)
datafusion / vortex-compact (0.799x ➖, 5↑ 0↓)
datafusion / parquet (0.928x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.964x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.954x ➖, 0↑ 0↓)
duckdb / parquet (0.917x ➖, 1↑ 0↓)
Full attributed analysis
|
Both kernels duplicated the same materialize-limbs / combine-validity / wrap-as- BoolArray boilerplate. Factor it into `two_limb::eval`, which takes a closure mapping the high/low limb slices to a packed BitBuffer, so `compare` and `between` reduce to extracting the bound(s) and calling it. Drop the now-internal `materialize_limbs`/`reconstruct` from the crate-visible surface. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The mod/compare/between test modules each duplicated the same i128 limb-split construction. Hoist it to a single `#[cfg(test)] two_limb_array` in two_limb.rs; the two thin local adapters just add `.into_array()` / a default validity. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
DecimalByteParts already pushed
compareagainst a constant down to its numeric MSP child. This adds the symmetricbetweenkernel so that bounded-range predicates are evaluated directly on the compact MSP representation instead of canonicalizing to a wide DecimalArray.Both bounds are converted to the MSP's physical integer type and the comparison is delegated to the MSP's own
between. When a bound falls outside the MSP's integer range the kernel falls back to the canonical decimalbetween, which already handles the overflow directions.Summary
Closes: #000
Testing