feat: add inner_product scalar function#21861
Merged
Merged
Conversation
Jefffrey
reviewed
Apr 27, 2026
| )] | ||
| #[derive(Debug, PartialEq, Eq, Hash)] | ||
| pub struct InnerProduct { | ||
| signature: Signature, |
Contributor
There was a problem hiding this comment.
Should we add a dot_product alias?
Contributor
Author
There was a problem hiding this comment.
Thanks @Jefffrey — added dot_product as an alias in ef9895005, with SLT coverage for both a constant-args and a multi-row-with-NULL case. Doc regen picked up the alias automatically (#### Aliases block under inner_product, plus a top-level ### dot_product Alias of stub).
Contributor
|
I had a thought about adding new functions: |
8c05259 to
ef98950
Compare
Jefffrey
approved these changes
Apr 30, 2026
Contributor
|
I think once merge conflict is fixed we should be good to merge this |
Contributor
|
I merged up to resolve a conflict |
Contributor
lyne7-sc
pushed a commit
to lyne7-sc/datafusion
that referenced
this pull request
May 20, 2026
## Which issue does this PR close? Part of apache#21536 — split of apache#21371 into one-function-per-PR. Third in the series after apache#21542 (cosine_distance) and apache#21861 (inner_product). ## Rationale for this change Adds `array_normalize(array)` — the L2-normalized version of a numeric input vector. Computed as `array[i] / sqrt(sum(array[i]^2))` per element. Returns the same shape as the input (`List<Float64>` or `LargeList<Float64>`). Aliased as `list_normalize` to match the `array_X`/`list_X` convention used across the crate. ## What changes are included in this PR? Coercion shell mirrors the merged cosine_distance/inner_product pattern: - `coerce_types` accepts `List`/`LargeList`/`FixedSizeList` of any numeric inner type, plus bare `NULL`. After coercion the inner function only sees `List(Float64)` or `LargeList(Float64)`. - Per-row L2 norm computed inline (no shared module), using a single `as_float64_array(list_array.values())` downcast plus `value_offsets()` slicing — no per-row downcasts. - Manual list builder: `Vec<f64>` for values, `Vec<O>` for offsets, `NullBuffer` for row validity. Per-row semantics: - NULL row → NULL output - NULL element in list → NULL row - Empty list → empty list (no division-by-zero hazard) - Zero magnitude → NULL row (consistent with cosine_distance's zero-magnitude → NULL) - Otherwise → divide each element by `sqrt(sum-of-squares)` ## Are these changes tested? Yes. SLT covers: - 3-4-5 right triangle, 3D vector, already-unit-axis, single non-zero component, negative components - Bare `NULL` input, NULL element in list, zero vector, empty array - `LargeList`, `FixedSizeList` (via coercion), `Float32` and `Int64` inner types, integer literals - Multi-row query mixing normal / NULL row / zero-vector row / null-element row - Plan error for non-list input - No-args error - Return-type assertion (`List(Float64)`) - `list_normalize` alias coverage (constant + multi-row with NULL) ## Are there any user-facing changes? New scalar function `array_normalize` (alias `list_normalize`), documented in `docs/source/user-guide/sql/scalar_functions.md`.
This was referenced May 20, 2026
crm26
added a commit
to crm26/datafusion
that referenced
this pull request
May 22, 2026
Adds `array_add(array1, array2)` returning the element-wise sum of two numeric arrays. Aliased as `list_add`. Follows the per-function split pattern established by cosine_distance (apache#21542), inner_product (apache#21861), and array_normalize (apache#22013) per tracking issue apache#21536. Semantics: - NULL row in either input -> NULL row out - NULL element at position i in either input -> NULL element at i out (per-element propagation, divergent from inner_product which nulls the whole row; chosen because output is a list, not a scalar) - Length mismatch between rows -> exec_err - Empty arrays -> empty array Supports List, LargeList, and FixedSizeList inputs; numeric element types are coerced to Float64. If any input is LargeList, both sides are widened to LargeList for homogeneous runtime dispatch. Uses OffsetBufferBuilder + NullBufferBuilder per the pattern adopted in array_normalize round 1.
crm26
added a commit
to crm26/datafusion
that referenced
this pull request
May 22, 2026
Adds `array_add(array1, array2)` returning the element-wise sum of two numeric arrays. Aliased as `list_add`. Follows the per-function split pattern established by cosine_distance (apache#21542), inner_product (apache#21861), and array_normalize (apache#22013) per tracking issue apache#21536. Semantics: - NULL row in either input -> NULL row out - NULL element at position i in either input -> NULL element at i out (per-element propagation, divergent from inner_product which nulls the whole row; chosen because output is a list, not a scalar) - Length mismatch between rows -> exec_err - Empty arrays -> empty array Supports List, LargeList, and FixedSizeList inputs; numeric element types are coerced to Float64. If any input is LargeList, both sides are widened to LargeList for homogeneous runtime dispatch. Uses OffsetBufferBuilder + NullBufferBuilder per the pattern adopted in array_normalize round 1.
crm26
added a commit
to crm26/datafusion
that referenced
this pull request
May 22, 2026
Adds `array_add(array1, array2)` returning the element-wise sum of two numeric arrays. Aliased as `list_add`. Follows the per-function split pattern established by cosine_distance (apache#21542), inner_product (apache#21861), and array_normalize (apache#22013) per tracking issue apache#21536. Semantics: - NULL row in either input -> NULL row out - NULL element at position i in either input -> NULL element at i out (per-element propagation, divergent from inner_product which nulls the whole row; chosen because output is a list, not a scalar) - Length mismatch between rows -> exec_err - Empty arrays -> empty array Supports List, LargeList, and FixedSizeList inputs; numeric element types are coerced to Float64. If any input is LargeList, both sides are widened to LargeList for homogeneous runtime dispatch. Uses OffsetBufferBuilder + NullBufferBuilder per the pattern adopted in array_normalize round 1.
zhuqi-lucas
pushed a commit
to zhuqi-lucas/arrow-datafusion
that referenced
this pull request
May 26, 2026
## Which issue does this PR close? Partial of apache#21536 — `array_scale` (the list+scalar arithmetic function in the vector math series). ## Rationale for this change Continues the per-function split requested by @alamb on apache#21536. Three sibling PRs already merged: `cosine_distance` (apache#21542), `inner_product` (apache#21861), `array_normalize` (apache#22013). `array_add` is in flight as apache#22459 by @SubhamSinghal. Adds element-wise scalar multiplication for numeric arrays, returning a list of the same shape. Aliased as `list_scale` to match the `array_X` / `list_X` precedent in this crate. ## What changes are included in this PR? - New scalar UDF `array_scale(array, scalar)` in `datafusion/functions-nested/src/array_scale.rs` - Module wire-up + registration in `datafusion/functions-nested/src/lib.rs` - SLT tests at `datafusion/sqllogictest/test_files/array_scale.slt` - Auto-generated function docs entry in `docs/source/user-guide/sql/scalar_functions.md` **Signature:** first arg `List/LargeList/FixedSizeList<numeric>`, second arg numeric scalar. Both coerce to `Float64`. Same list-widening rules as the binary-op siblings. **NULL semantics:** - NULL row in array → NULL row out - NULL scalar → NULL row out (whole-row, because the scalar applies uniformly) - NULL element at position \`i\` → NULL element at \`i\` out (per-element propagation) - Empty array → empty array **Builders:** uses \`OffsetBufferBuilder\` + \`NullBufferBuilder\` per the pattern adopted in the round-1 review of apache#22013. ## Are these changes tested? Yes. \`array_scale.slt\` covers: - Happy paths (positive, negative, zero, fractional, single-element) - NULL propagation at all three levels (NULL row, NULL scalar, NULL element) - All list type variants (\`List\`, \`LargeList\`, \`FixedSizeList\`) - Numeric inner type coercion (Float32, Int64, integer literals) - Multi-row queries with both constant-scalar broadcast and per-row column scalar - Error paths (non-numeric scalar, non-list first arg, wrong arity) - Empty array - \`list_scale\` alias ## Are there any user-facing changes? Yes — new SQL scalar function \`array_scale(array, scalar)\` and its alias \`list_scale\`. Documented in \`docs/source/user-guide/sql/scalar_functions.md\`.
crm26
added a commit
to crm26/datafusion
that referenced
this pull request
May 26, 2026
Adds `array_sum(array)` returning the sum of elements in a numeric array. Aliased as `list_sum`. Part of the per-function split sequence on tracking issue apache#21536, following the pattern of the already-merged PRs in this series (cosine_distance apache#21542, inner_product apache#21861, array_normalize apache#22013, array_scale apache#22466). Semantics: - NULL row in array -> NULL row out - NULL elements are skipped (SQL aggregate convention; matches PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose every element is NULL yields NULL. - Empty array -> 0.0 (additive identity, matches SQL SUM over no rows conceptually, and DuckDB list_sum([]) = 0) Input is List/LargeList/FixedSizeList of any numeric type; elements are coerced to Float64. Output is Float64.
crm26
added a commit
to crm26/datafusion
that referenced
this pull request
May 26, 2026
Adds `array_sum(array)` returning the sum of elements in a numeric array. Aliased as `list_sum`. Part of the per-function split sequence on tracking issue apache#21536, following the pattern of the already-merged PRs in this series (cosine_distance apache#21542, inner_product apache#21861, array_normalize apache#22013, array_scale apache#22466). Semantics: - NULL row in array -> NULL row out - NULL elements are skipped (SQL aggregate convention; matches PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose every element is NULL yields NULL. - Empty array -> 0.0 (additive identity, matches SQL SUM over no rows conceptually, and DuckDB list_sum([]) = 0) Input is List/LargeList/FixedSizeList of any numeric type; elements are coerced to Float64. Output is Float64.
crm26
added a commit
to crm26/datafusion
that referenced
this pull request
May 26, 2026
Adds `array_sum(array)` returning the sum of elements in a numeric array. Aliased as `list_sum`. Part of the per-function split sequence on tracking issue apache#21536, following the pattern of the already-merged PRs in this series (cosine_distance apache#21542, inner_product apache#21861, array_normalize apache#22013, array_scale apache#22466). Semantics: - NULL row in array -> NULL row out - NULL elements are skipped (SQL aggregate convention; matches PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose every element is NULL yields NULL. - Empty array -> 0.0 (additive identity, matches SQL SUM over no rows conceptually, and DuckDB list_sum([]) = 0) Input is List/LargeList/FixedSizeList of any numeric type; elements are coerced to Float64. Output is Float64.
crm26
added a commit
to crm26/datafusion
that referenced
this pull request
May 26, 2026
Adds `array_sum(array)` returning the sum of elements in a numeric array. Aliased as `list_sum`. Part of the per-function split sequence on tracking issue apache#21536, following the pattern of the already-merged PRs in this series (cosine_distance apache#21542, inner_product apache#21861, array_normalize apache#22013, array_scale apache#22466). Semantics: - NULL row in array -> NULL row out - NULL elements are skipped (SQL aggregate convention; matches PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose every element is NULL yields NULL. - Empty array -> 0.0 (additive identity, matches SQL SUM over no rows conceptually, and DuckDB list_sum([]) = 0) Input is List/LargeList/FixedSizeList of any numeric type; elements are coerced to Float64. Output is Float64.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Part of #21536 — split of #21371 into one-function-per-PR.
Rationale for this change
Adds
inner_product(array1, array2)— the dot product of two equal-length numeric arrays, returningFloat64. Computed assum(array1[i] * array2[i]).What changes are included in this PR?
Mirrors the structural pattern of merged #21542 (
cosine_distance):coerce_typesforList/LargeList/FixedSizeListof any numeric inner type, with widening toLargeListwhen any input isLargeList(per the fix: array_concat widens container variant for mixed List/LargeList inputs #21704 pattern)NULL→NULL, NULL row → NULL, NULL element in list → NULLas_float64_array(list_array.values())downcast, slice byvalue_offsets(), iterate viaScalarBuffer<f64>The arithmetic is the only semantic divergence from
cosine_distance:dot += a*b(no magnitude or normalization)0.0(sum of empty set), notNULLinner_product([0,0], [1,2])returns0, which is well-defined for inner product)Are these changes tested?
Yes. SLT covers:
NULLin either or both positionsLargeListinputs(List, LargeList)in both orders(FixedSizeList, FixedSizeList)and(FixedSizeList, LargeList)Float32andInt64inner type coercion0)Float64)Are there any user-facing changes?
New scalar function
inner_product, documented indocs/source/user-guide/sql/scalar_functions.md.