Skip to content

feat: add inner_product scalar function#21861

Merged
alamb merged 4 commits into
apache:mainfrom
crm26:feat/inner-product
May 3, 2026
Merged

feat: add inner_product scalar function#21861
alamb merged 4 commits into
apache:mainfrom
crm26:feat/inner-product

Conversation

@crm26
Copy link
Copy Markdown
Contributor

@crm26 crm26 commented Apr 26, 2026

Which issue does this PR close?

Part of #21536 — split of #21371 into one-function-per-PR.

Rationale for this change

Adds inner_product(array1, array2) — the dot product of two equal-length numeric arrays, returning Float64. Computed as sum(array1[i] * array2[i]).

What changes are included in this PR?

Mirrors the structural pattern of merged #21542 (cosine_distance):

  • Same coerce_types for List/LargeList/FixedSizeList of any numeric inner type, with widening to LargeList when any input is LargeList (per the fix: array_concat widens container variant for mixed List/LargeList inputs #21704 pattern)
  • Same NULL semantics: bare NULLNULL, NULL row → NULL, NULL element in list → NULL
  • Same Arrow-idiomatic implementation: single as_float64_array(list_array.values()) downcast, slice by value_offsets(), iterate via ScalarBuffer<f64>
  • No alias, no shared module — standalone, inline math

The arithmetic is the only semantic divergence from cosine_distance:

  • dot += a*b (no magnitude or normalization)
  • Empty arrays return 0.0 (sum of empty set), not NULL
  • No zero-magnitude special case (inner_product([0,0], [1,2]) returns 0, which is well-defined for inner product)

Are these changes tested?

Yes. SLT covers:

  • Orthogonal, identical, opposite, general non-trivial vectors
  • Single zero vector, both zero vectors
  • Bare NULL in either or both positions
  • NULL element inside a list (returns NULL for that row)
  • Mismatched lengths (error)
  • LargeList inputs
  • Mixed (List, LargeList) in both orders
  • (FixedSizeList, FixedSizeList) and (FixedSizeList, LargeList)
  • Float32 and Int64 inner type coercion
  • Multi-row query with NULL row propagation
  • Empty arrays (returns 0)
  • No-args error
  • Return-type assertion (Float64)

Are there any user-facing changes?

New scalar function inner_product, documented in docs/source/user-guide/sql/scalar_functions.md.

@github-actions github-actions Bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Apr 26, 2026
)]
#[derive(Debug, PartialEq, Eq, Hash)]
pub struct InnerProduct {
signature: Signature,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a dot_product alias?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jefffrey — added dot_product as an alias in ef9895005, with SLT coverage for both a constant-args and a multi-row-with-NULL case. Doc regen picked up the alias automatically (#### Aliases block under inner_product, plus a top-level ### dot_product Alias of stub).

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 27, 2026

I had a thought about adding new functions:

@crm26 crm26 force-pushed the feat/inner-product branch from 8c05259 to ef98950 Compare April 29, 2026 21:45
@Jefffrey
Copy link
Copy Markdown
Contributor

Jefffrey commented May 2, 2026

I think once merge conflict is fixed we should be good to merge this

@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 3, 2026

I merged up to resolve a conflict

@alamb alamb enabled auto-merge May 3, 2026 12:00
@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 3, 2026

Thanks @crm26 and @Jefffrey

@alamb alamb added this pull request to the merge queue May 3, 2026
Merged via the queue into apache:main with commit 9a29e33 May 3, 2026
36 checks passed
lyne7-sc pushed a commit to lyne7-sc/datafusion that referenced this pull request May 20, 2026
## Which issue does this PR close?

Part of apache#21536 — split of apache#21371 into one-function-per-PR. Third in the
series after apache#21542 (cosine_distance) and apache#21861 (inner_product).

## Rationale for this change

Adds `array_normalize(array)` — the L2-normalized version of a numeric
input vector. Computed as `array[i] / sqrt(sum(array[i]^2))` per
element. Returns the same shape as the input (`List<Float64>` or
`LargeList<Float64>`).

Aliased as `list_normalize` to match the `array_X`/`list_X` convention
used across the crate.

## What changes are included in this PR?

Coercion shell mirrors the merged cosine_distance/inner_product pattern:
- `coerce_types` accepts `List`/`LargeList`/`FixedSizeList` of any
numeric inner type, plus bare `NULL`. After coercion the inner function
only sees `List(Float64)` or `LargeList(Float64)`.
- Per-row L2 norm computed inline (no shared module), using a single
`as_float64_array(list_array.values())` downcast plus `value_offsets()`
slicing — no per-row downcasts.
- Manual list builder: `Vec<f64>` for values, `Vec<O>` for offsets,
`NullBuffer` for row validity.

Per-row semantics:
- NULL row → NULL output
- NULL element in list → NULL row
- Empty list → empty list (no division-by-zero hazard)
- Zero magnitude → NULL row (consistent with cosine_distance's
zero-magnitude → NULL)
- Otherwise → divide each element by `sqrt(sum-of-squares)`

## Are these changes tested?

Yes. SLT covers:
- 3-4-5 right triangle, 3D vector, already-unit-axis, single non-zero
component, negative components
- Bare `NULL` input, NULL element in list, zero vector, empty array
- `LargeList`, `FixedSizeList` (via coercion), `Float32` and `Int64`
inner types, integer literals
- Multi-row query mixing normal / NULL row / zero-vector row /
null-element row
- Plan error for non-list input
- No-args error
- Return-type assertion (`List(Float64)`)
- `list_normalize` alias coverage (constant + multi-row with NULL)

## Are there any user-facing changes?

New scalar function `array_normalize` (alias `list_normalize`),
documented in `docs/source/user-guide/sql/scalar_functions.md`.
crm26 added a commit to crm26/datafusion that referenced this pull request May 22, 2026
Adds `array_add(array1, array2)` returning the element-wise sum of two
numeric arrays. Aliased as `list_add`. Follows the per-function split
pattern established by cosine_distance (apache#21542), inner_product (apache#21861),
and array_normalize (apache#22013) per tracking issue apache#21536.

Semantics:
- NULL row in either input -> NULL row out
- NULL element at position i in either input -> NULL element at i out
  (per-element propagation, divergent from inner_product which nulls
  the whole row; chosen because output is a list, not a scalar)
- Length mismatch between rows -> exec_err
- Empty arrays -> empty array

Supports List, LargeList, and FixedSizeList inputs; numeric element
types are coerced to Float64. If any input is LargeList, both sides
are widened to LargeList for homogeneous runtime dispatch.

Uses OffsetBufferBuilder + NullBufferBuilder per the pattern adopted
in array_normalize round 1.
crm26 added a commit to crm26/datafusion that referenced this pull request May 22, 2026
Adds `array_add(array1, array2)` returning the element-wise sum of two
numeric arrays. Aliased as `list_add`. Follows the per-function split
pattern established by cosine_distance (apache#21542), inner_product (apache#21861),
and array_normalize (apache#22013) per tracking issue apache#21536.

Semantics:
- NULL row in either input -> NULL row out
- NULL element at position i in either input -> NULL element at i out
  (per-element propagation, divergent from inner_product which nulls
  the whole row; chosen because output is a list, not a scalar)
- Length mismatch between rows -> exec_err
- Empty arrays -> empty array

Supports List, LargeList, and FixedSizeList inputs; numeric element
types are coerced to Float64. If any input is LargeList, both sides
are widened to LargeList for homogeneous runtime dispatch.

Uses OffsetBufferBuilder + NullBufferBuilder per the pattern adopted
in array_normalize round 1.
crm26 added a commit to crm26/datafusion that referenced this pull request May 22, 2026
Adds `array_add(array1, array2)` returning the element-wise sum of two
numeric arrays. Aliased as `list_add`. Follows the per-function split
pattern established by cosine_distance (apache#21542), inner_product (apache#21861),
and array_normalize (apache#22013) per tracking issue apache#21536.

Semantics:
- NULL row in either input -> NULL row out
- NULL element at position i in either input -> NULL element at i out
  (per-element propagation, divergent from inner_product which nulls
  the whole row; chosen because output is a list, not a scalar)
- Length mismatch between rows -> exec_err
- Empty arrays -> empty array

Supports List, LargeList, and FixedSizeList inputs; numeric element
types are coerced to Float64. If any input is LargeList, both sides
are widened to LargeList for homogeneous runtime dispatch.

Uses OffsetBufferBuilder + NullBufferBuilder per the pattern adopted
in array_normalize round 1.
zhuqi-lucas pushed a commit to zhuqi-lucas/arrow-datafusion that referenced this pull request May 26, 2026
## Which issue does this PR close?

Partial of apache#21536 — `array_scale` (the list+scalar arithmetic function
in the vector math series).

## Rationale for this change

Continues the per-function split requested by @alamb on apache#21536. Three
sibling PRs already merged: `cosine_distance` (apache#21542), `inner_product`
(apache#21861), `array_normalize` (apache#22013). `array_add` is in flight as apache#22459
by @SubhamSinghal.

Adds element-wise scalar multiplication for numeric arrays, returning a
list of the same shape. Aliased as `list_scale` to match the `array_X` /
`list_X` precedent in this crate.

## What changes are included in this PR?

- New scalar UDF `array_scale(array, scalar)` in
`datafusion/functions-nested/src/array_scale.rs`
- Module wire-up + registration in
`datafusion/functions-nested/src/lib.rs`
- SLT tests at `datafusion/sqllogictest/test_files/array_scale.slt`
- Auto-generated function docs entry in
`docs/source/user-guide/sql/scalar_functions.md`

**Signature:** first arg `List/LargeList/FixedSizeList<numeric>`, second
arg numeric scalar. Both coerce to `Float64`. Same list-widening rules
as the binary-op siblings.

**NULL semantics:**
- NULL row in array → NULL row out
- NULL scalar → NULL row out (whole-row, because the scalar applies
uniformly)
- NULL element at position \`i\` → NULL element at \`i\` out
(per-element propagation)
- Empty array → empty array

**Builders:** uses \`OffsetBufferBuilder\` + \`NullBufferBuilder\` per
the pattern adopted in the round-1 review of apache#22013.

## Are these changes tested?

Yes. \`array_scale.slt\` covers:
- Happy paths (positive, negative, zero, fractional, single-element)
- NULL propagation at all three levels (NULL row, NULL scalar, NULL
element)
- All list type variants (\`List\`, \`LargeList\`, \`FixedSizeList\`)
- Numeric inner type coercion (Float32, Int64, integer literals)
- Multi-row queries with both constant-scalar broadcast and per-row
column scalar
- Error paths (non-numeric scalar, non-list first arg, wrong arity)
- Empty array
- \`list_scale\` alias

## Are there any user-facing changes?

Yes — new SQL scalar function \`array_scale(array, scalar)\` and its
alias \`list_scale\`. Documented in
\`docs/source/user-guide/sql/scalar_functions.md\`.
crm26 added a commit to crm26/datafusion that referenced this pull request May 26, 2026
Adds `array_sum(array)` returning the sum of elements in a numeric array.
Aliased as `list_sum`. Part of the per-function split sequence on
tracking issue apache#21536, following the pattern of the already-merged PRs
in this series (cosine_distance apache#21542, inner_product apache#21861,
array_normalize apache#22013, array_scale apache#22466).

Semantics:
- NULL row in array -> NULL row out
- NULL elements are skipped (SQL aggregate convention; matches
  PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose
  every element is NULL yields NULL.
- Empty array -> 0.0 (additive identity, matches SQL SUM over no rows
  conceptually, and DuckDB list_sum([]) = 0)

Input is List/LargeList/FixedSizeList of any numeric type; elements
are coerced to Float64. Output is Float64.
crm26 added a commit to crm26/datafusion that referenced this pull request May 26, 2026
Adds `array_sum(array)` returning the sum of elements in a numeric array.
Aliased as `list_sum`. Part of the per-function split sequence on
tracking issue apache#21536, following the pattern of the already-merged PRs
in this series (cosine_distance apache#21542, inner_product apache#21861,
array_normalize apache#22013, array_scale apache#22466).

Semantics:
- NULL row in array -> NULL row out
- NULL elements are skipped (SQL aggregate convention; matches
  PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose
  every element is NULL yields NULL.
- Empty array -> 0.0 (additive identity, matches SQL SUM over no rows
  conceptually, and DuckDB list_sum([]) = 0)

Input is List/LargeList/FixedSizeList of any numeric type; elements
are coerced to Float64. Output is Float64.
crm26 added a commit to crm26/datafusion that referenced this pull request May 26, 2026
Adds `array_sum(array)` returning the sum of elements in a numeric array.
Aliased as `list_sum`. Part of the per-function split sequence on
tracking issue apache#21536, following the pattern of the already-merged PRs
in this series (cosine_distance apache#21542, inner_product apache#21861,
array_normalize apache#22013, array_scale apache#22466).

Semantics:
- NULL row in array -> NULL row out
- NULL elements are skipped (SQL aggregate convention; matches
  PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose
  every element is NULL yields NULL.
- Empty array -> 0.0 (additive identity, matches SQL SUM over no rows
  conceptually, and DuckDB list_sum([]) = 0)

Input is List/LargeList/FixedSizeList of any numeric type; elements
are coerced to Float64. Output is Float64.
crm26 added a commit to crm26/datafusion that referenced this pull request May 26, 2026
Adds `array_sum(array)` returning the sum of elements in a numeric array.
Aliased as `list_sum`. Part of the per-function split sequence on
tracking issue apache#21536, following the pattern of the already-merged PRs
in this series (cosine_distance apache#21542, inner_product apache#21861,
array_normalize apache#22013, array_scale apache#22466).

Semantics:
- NULL row in array -> NULL row out
- NULL elements are skipped (SQL aggregate convention; matches
  PostgreSQL array_sum, DuckDB list_sum, Spark aggregate). A row whose
  every element is NULL yields NULL.
- Empty array -> 0.0 (additive identity, matches SQL SUM over no rows
  conceptually, and DuckDB list_sum([]) = 0)

Input is List/LargeList/FixedSizeList of any numeric type; elements
are coerced to Float64. Output is Float64.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants