[SPARK-56046][SQL] Typed SPJ partition key `Reducer`s by peter-toth · Pull Request #54884 · apache/spark

peter-toth · 2026-03-18T13:02:04Z

What changes were proposed in this pull request?

This PR adds a new method to SPJ partition key Reducers to return the type of a reduced partition key.

Why are the changes needed?

After the SPJ refactor some Iceberg SPJ tests, that join a hours transform partitioned table with a days transform partitioned table, started to fail. This is because after the refactor the keys of a KeyedPartitioning partitioning are InternalRowComparableWrappers, which include the type of the key, and when the partition keys are reduced the type of the reduced keys are inherited from their original type.

[SPARK-55535][SPARK-55092][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join #54330

This means that when hours transformed hour keys are reduced to days, the keys actually remain having IntegerType type, while the days transformed keys have DateType type in Iceberg. This type difference causes that the left and right side InternalRowComparableWrappers are not considered equal despite their InternalRow raw key data are equal.

Before the refactor the type of (possibly reduced) partition keys were not stored in the partitioning. When the left and right side raw keys were compared in EnsureRequirement a common comparator was initialized with the type of the left side keys.
So in the Iceberg SPJ tests the IntegerType keys were forced to be interpreted as DateType, or the DateType keys were forced to be interpreted as IntegerType, depending on the join order of the tables.
The reason why this was not causing any issues is that the PhysicalDataType of both DateType and IntegerType logical types is PhysicalIntegerType.

This PR:

Introduces a new TypedReducer with resultType() method to return the correct type of the reduced keys.
Properly compares the left and right side reduced key types and return an error when they are not the same.
Adds a new spark.sql.sources.v2.bucketing.allowIncompatibleTransformTypes.enabled=true flag to keep the old behavior and consider the reduced key types the same if they share a common physical type.

Does this PR introduce any user-facing change?

Yes, the reduced key types are now properly compared and incompatibilities are reported to users, but the legacy flag can allow the old behaviour.

How was this patch tested?

Added new UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

peter-toth · 2026-03-18T17:12:02Z

...ore/src/test/scala/org/apache/spark/sql/connector/catalog/functions/transformFunctions.scala

+object YearsFunction extends ScalarFunction[Int] with ReducibleFunction[Int, Int] {
  override def inputTypes(): Array[DataType] = Array(TimestampType)
-  override def resultType(): DataType = LongType
+  override def resultType(): DataType = IntegerType


I changed the test years transform to return IntegerType and the test days transform to return DateType logical types, because those 2 differ but have the same PhysicalIntegerType physical type.
I also made days reducible to years, which is very similar to what Iceberg can do with hours and days.

peter-toth · 2026-03-18T17:20:17Z

cc @szehon-ho , @dongjoon-hyun

dongjoon-hyun · 2026-03-18T18:04:15Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/Reducer.java

+
+  /**
+   * Returns the {@link DataType data type} of values produced by this function.
+   * It can return null to signal it doesn't change the input type.


It's a little counter-intuitive design. May I ask why we need to use null instead of returning the input type, @peter-toth ?

Unfortunately, in this interface we the don't have access to the original transform function's result type (the type argument I is not a Spark logical type), but we need to return some default value to indicate that the reducer doesn't change the result type (whatever it is).

Could you elaborate that in the function description explicitly, please?

Added more documentation in a00c069.

I was thinking about this. Does this change makes it more intuitive?

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

dongjoon-hyun · 2026-03-18T18:29:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .createWithDefault(false)

+  val V2_BUCKETING_ALLOW_INCOMPATIBLE_TRANSFORM_TYPES =
+    buildConf("spark.sql.sources.v2.bucketing.allowIncompatibleTransformTypes.enabled")


Do you think we can set this this configuration false for some cases in the future, @peter-toth ? I'm a little confused when it makes senses that we are going to disallow incompatible transform types.

This is a good question and I too was thinking about it. I feel we should not compare different logical types due to their different semantical meanings, but seemingly this is what we do currently in some cases, so we should probably keep the behavior for now. I think in a future Spark release we can change this config to make sure a comparison makes sense.

yea im also thinking, if there is some dangerous discrepancy now , it is worth a behave change to fix it.

The only consumer that i know of is Iceberg , which has hoursToDay reducer that changes type, and bucketReducer (which doesnt change type). Iceberg will need to recompile against Spark 4.2 anyway so it's an opportunity for us to fix it there.

wdyt (as regards to the Spark release policy) ?

Yeah, very likely Iceberg is the only project that implemented reducers.

If we are ok with fixing the issue in Iceberg then probably we don't need the latest commit, but we can keep resultType() in Reducer, remove its default value and drop this config.

I'm actively testing Spark 4.2.0 integration in Iceberg. The issue was only in 4.2.0-preview3 and I can work on the Iceberg changes for next preview release. +1 to drop this config.

Ok with me.
@szehon-ho , @dongjoon-hyun let me know your preference and I can change this PR.

dongjoon-hyun · 2026-03-18T18:29:36Z

cc @aokolnychyi , @cloud-fan , @gengliangwang , too.

dongjoon-hyun · 2026-03-18T19:02:17Z

...ore/src/test/scala/org/apache/spark/sql/connector/catalog/functions/transformFunctions.scala

 }

-object YearsFunction extends ScalarFunction[Long] {
+object YearsFunction extends ScalarFunction[Int] with ReducibleFunction[Int, Int] {


Do you think we can spin-off this one independently from this PR?

Technically I could, but I don't see a strong reason to separate these test function changes from the other parts of the PR.
We could argue that a spin-off makes sense to make these functions similar to their Iceberg versions, but that's not necessary needed for existing generic DSv2 tests. Actually, this particular PR requires 2 test functions with different logical but same physical result types and making them similar to their Iceberg versions is just a coincidence.

But just let me know if you think a spin-off still makes sense.

It's simply I can help you more easily if you can offer a spin-off PR for this PR. (in terms of the speed of review speed based on the narrowed review scope?)

Anyway, it's up to you, @peter-toth . You can keep all in the single bucket as you wish. No arguable point on that. :)

dongjoon-hyun · 2026-03-18T19:19:22Z

Thank you for the catching this and providing a fix promptly, @peter-toth .
I'll leave this to the other reviewers.

gengliangwang · 2026-03-18T20:52:17Z

cc @szehon-ho as well

szehon-ho · 2026-03-18T21:55:04Z

im taking a look, thanks

szehon-ho · 2026-03-19T00:35:35Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/Reducer.java

+  /**
+   * Returns the {@link DataType data type} of values produced by this reducer.
+   *
+   * As a reducer doesn't know the result {@link DataType data type} of the reduced transform


this comment is a bit confusing? do you mean, 'if the reducer doens't know the result...'?

anyway this was a mistake from my end when introducing the API, sorry about it, i think this the resultType is actuallly pretty important to have.

One more thought, we can also clarify the class tag O to also use the same langauge (result), and indicate that its the physical Java type to reduce confusion

Yeah, this sentence was confusing.

In the latest commit I extracted the new resultType() method to TypedReducer to address @dongjoon-hyun's comment about the counter-intuitive null usage (#54884 (comment)) and ellaborated on the types.

pan3793 · 2026-03-19T02:46:44Z

Properly compares the left and right side reduced key types and return an error when they are not the same.

the previously always using the left side key type behavior is indeed problematic, but the new rule looks too strict, is it possible to follow the behavior of join key type mismatch handling?

When a join has an EqualTo(leftKey, rightKey) condition where types differ, ImplicitTypeCoercion kicks in:

Calls findTightestCommonType(left.dataType, right.dataType) to find a compatible type
Wraps operands in Cast expressions to coerce both to the common type

peter-toth · 2026-03-19T09:10:13Z

When a join has an EqualTo(leftKey, rightKey) condition where types differ, ImplicitTypeCoercion kicks in:

Calls findTightestCommonType(left.dataType, right.dataType) to find a compatible type

Wraps operands in Cast expressions to coerce both to the common type

I think this is a bit different issue to type coercion as the ReducibleFunctions on both sides know each other when they return the Reducers. The Reducers' responsibility to produce comparable reduced values. The only issue now is that we don't know the type of those values.

pan3793 · 2026-03-19T12:04:59Z

The Reducers' responsibility to produce comparable reduced values.

@peter-toth, this sounds reasonable, maybe we should emphasize that in the javadocs? the = check requires exactly both value and data type match

... r(f_source(x)) = f_target(x) ...
... r1(f_source(x)) = r2(f_target(x)) ...

[SPARK-56046][SQL] Typed SPJ partition key reducers

fa4bce7

peter-toth force-pushed the SPARK-56046-typed-spj-reducers branch from 580ca49 to fa4bce7 Compare March 18, 2026 13:54

peter-toth marked this pull request as draft March 18, 2026 16:24

peter-toth added 2 commits March 18, 2026 17:52

fix config

494b923

fix expected ordering type of years transform

e31b361

peter-toth commented Mar 18, 2026

View reviewed changes

peter-toth marked this pull request as ready for review March 18, 2026 17:20

dongjoon-hyun reviewed Mar 18, 2026

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Mar 18, 2026

View reviewed changes

address review comments

a00c069

dongjoon-hyun reviewed Mar 18, 2026

View reviewed changes

szehon-ho reviewed Mar 19, 2026

View reviewed changes

manuzhang mentioned this pull request Mar 19, 2026

Spark: Add support for 4.2.0-preview apache/iceberg#14984

Draft

peter-toth added 2 commits March 19, 2026 08:52

Merge branch 'master' into SPARK-56046-typed-spj-reducers

a43f4b4

Extract TypedReducer from Reducer

c20b301

Conversation

peter-toth commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

peter-toth Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Mar 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang commented Mar 18, 2026

Uh oh!

szehon-ho commented Mar 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Mar 19, 2026

Uh oh!

peter-toth commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pan3793 commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

peter-toth commented Mar 18, 2026 •

edited

Loading

peter-toth Mar 18, 2026 •

edited

Loading

peter-toth Mar 18, 2026 •

edited

Loading

szehon-ho Mar 19, 2026 •

edited

Loading

peter-toth Mar 18, 2026 •

edited

Loading

dongjoon-hyun commented Mar 18, 2026 •

edited

Loading

peter-toth commented Mar 19, 2026 •

edited

Loading