Skip to content

Conversation

@Jefffrey
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Main value add here is ensure UDAFs encode their actual accepted types in their signature instead of internally casting to the actual types they support from a wider signature. Also doing some driveby refactoring of removing manual Debug impls.

What changes are included in this PR?

See rationale.

Are these changes tested?

Existing tests.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the functions Changes to functions implementation label Jan 10, 2026
approx_percentile_cont: ApproxPercentileCont,
}

impl Debug for ApproxPercentileContWithWeight {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only advantage I see to these manual impls is sometimes they add the name field; I personally don't see those as useful enough to need a separate impl so opting for derive to remove code where possible.

Comment on lines +82 to +85
signature: Signature::exact(
vec![DataType::Float64, DataType::Float64],
Volatility::Immutable,
),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way type coercion handles casting for us; we can now remove the code that does casting internally in the accumulators for us

}

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
if !arg_types[0].is_numeric() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be more confident in our signature code to guard this for us, to promote consistency across our UDFs

arr2.next()
} else {
None
for (value1, value2) in values1.iter().zip(values2) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little driveby refactor to make iteration cleaner

let means1 = downcast_value!(states[1], Float64Array);
let means2 = downcast_value!(states[2], Float64Array);
let cs = downcast_value!(states[3], Float64Array);
let counts = as_uint64_array(&states[0])?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started using as_float64_array above so decided to make the whole file consistent in which way it handles downcasting

@Jefffrey Jefffrey marked this pull request as ready for review January 10, 2026 08:08
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Jefffrey -- this looks like a really nice cleanup to me

if value1.is_none() || value2.is_none() {
continue;
}
let values1 = as_float64_array(&values[0])?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is certainly much nicer


let value1 = unwrap_or_internal_err!(value1);
let value2 = unwrap_or_internal_err!(value2);
for (value1, value2) in values1.iter().zip(values2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much nicer

) -> Result<()> {
assert_eq!(values.len(), 3, "two arguments to merge_batch");
// first batch is counts, second is partial means, third is partial m2s
let partial_counts = downcast_value!(values[0], UInt64Array);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should deprecate downcast value and direct people to the as_* methods ? It seems strange to have two ways to do something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have definitely found it a little confusing all the ways we can do downcasting, for example:

While they all do have slightly different behaviours (with how they treat erroring and such), it would help to be consistent 🤔

We could make downcast_value! deprecated and eventually try to make it private to common

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the biggest difference as I recall is taht ones in arrow panic, and the ones in DataFusion retun errors

@Jefffrey Jefffrey added this pull request to the merge queue Jan 21, 2026
Merged via the queue into apache:main with commit 10a1d4e Jan 21, 2026
28 checks passed
@Jefffrey
Copy link
Contributor Author

Thanks @alamb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants