Skip to content

feat: Count and Mean aggregates#7201

Draft
blaginin wants to merge 5 commits intodevelopfrom
db/count-and-mean
Draft

feat: Count and Mean aggregates#7201
blaginin wants to merge 5 commits intodevelopfrom
db/count-and-mean

Conversation

@blaginin
Copy link
Copy Markdown
Member

@blaginin blaginin commented Mar 29, 2026

No description provided.

Signed-off-by: blaginin <dima@spiraldb.com>
Co-authored-by: Claude <noreply@anthropic.com>
@blaginin blaginin self-assigned this Mar 29, 2026
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: blaginin <dima@spiraldb.com>
/// Return the count of non-null elements in an array.
///
/// See [`Count`] for details.
pub fn count(array: &ArrayRef, ctx: &mut ExecutionCtx) -> VortexResult<u64> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is counting things in an array, should it return a usize?

false
}

fn accumulate(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to register a generic aggregate kernel to reduce count-non-null to be Array.validity().sum(), then we avoid decompressing all the data.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

575672a

added generic kernels first, but then it's hard to express "I never want Count::accumulate to be called" - so switched to try_accumulate instead

}

fn partial_struct_dtype() -> DType {
DType::Struct(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you static LazyLock this entire dtype inside this function?

@gatesn gatesn added the changelog/feature A new feature label Mar 30, 2026
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be decimal for decimals?

Comment on lines +53 to +56
pub struct MeanPartial {
sum: f64,
count: u64,
}
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will compute with a pretty large error?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$$\mu_n = \sum_{i=0}^n x_i/n \iff \mu_{n+1} = \mu_n + \frac{x_{n+1} - \mu_n}{1+n}$$

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need only hold the partial mean and count

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can incrementally push the next value into this partial, but you cannot combine two partials afaik?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For combining you do. $$\mu_{AB} = \frac{n_A \mu_A + n_B \mu_B}{n_A + n_B}$$

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DF sum is T and ours is f64

Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error comes from numbers of different scales being combined, not from number of OPS.

Since we are storing these on disk we cannot "just change it later".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DF only supports AVG for floats, decimals, and durations. And coerces all floats to f64.

Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| Input Type   | DuckDB | DataFusion | Velox      |
|--------------|--------|------------|------------|
| Integer types| DOUBLE | Float64    | DOUBLE     |
| FLOAT/REAL   | DOUBLE | Float64    | REAL       |
| DOUBLE       | DOUBLE | Float64    | DOUBLE     |
| DECIMAL(p,s) | DOUBLE | Float64    | DECIMAL(p,s)|

Copy link
Copy Markdown
Member Author

@blaginin blaginin Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought about this more - i think it should decimal for decimals? - happy to change

blaginin added 3 commits April 2, 2026 14:24
Signed-off-by: blaginin <github@blaginin.me>
Signed-off-by: blaginin <github@blaginin.me>
@blaginin blaginin changed the title Count and Mean aggregates feat: Count and Mean aggregates Apr 2, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 2, 2026

Merging this PR will degrade performance by 21.7%

⚡ 1 improved benchmark
❌ 2 regressed benchmarks
✅ 1119 untouched benchmarks
⏩ 1530 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_canonical_into[(1000, 10)] 72.9 µs 65.9 µs +10.49%
Simulation varbinview_zip_block_mask 2.9 ms 3.8 ms -21.7%
Simulation varbinview_zip_fragmented_mask 6 ms 7 ms -14.51%

Comparing db/count-and-mean (575672a) with develop (38ab5af)

Open in CodSpeed

Footnotes

  1. 1530 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants