Skip to content

[SPARK-56739][SQL] Normalize CTE ids of orphan CTERelationRef in NormalizeCTEIds#56083

Open
shrirangmhalgi wants to merge 1 commit into
apache:masterfrom
shrirangmhalgi:SPARK-56739-normalize-orphan-cte-refs
Open

[SPARK-56739][SQL] Normalize CTE ids of orphan CTERelationRef in NormalizeCTEIds#56083
shrirangmhalgi wants to merge 1 commit into
apache:masterfrom
shrirangmhalgi:SPARK-56739-normalize-orphan-cte-refs

Conversation

@shrirangmhalgi
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Normalize CTE IDs of orphan CTERelationRef nodes in NormalizeCTEIds. Previously, only CTERelationRef nodes inside WithCTE were normalized via canonicalizeCTE. Refs that exist outside any WithCTE (orphans) kept their original IDs.

Why are the changes needed?

After InlineCTE or MergeSubplans runs, some CTERelationRef nodes can end up outside their parent WithCTE node. When NormalizeCTEIds processes the plan, these orphan refs are skipped, leaving non-normalized IDs. This breaks plan comparison and caching because the same logical plan gets different CTE IDs across sessions (since CTERelationDef uses a global monotonically increasing counter).

Does this PR introduce any user-facing change?

No. This is an internal plan normalization fix that affects plan caching correctness.

How was this patch tested?

Added NormalizeCTEIdsSuite with a test that constructs a plan with a CTERelationRef outside WithCTE and verifies all ref IDs are normalized. Without the fix, the orphan ref retains its original ID (100); with the fix, it's normalized to 0.

Was this patch authored or co-authored using generative AI tooling?

Yes.

@shrirangmhalgi shrirangmhalgi marked this pull request as draft May 24, 2026 04:25
…alizeCTEIds

CTERelationRef nodes that exist outside any WithCTE node (e.g., after InlineCTE or MergeSubplans removes the parent WithCTE) were not getting their IDs normalized. This causes plan comparison and caching to fail because the same logical plan gets different CTE IDs across sessions.

Add a case in applyInternal to normalize orphan CTERelationRefs whose cteId is already in the mapping.
@shrirangmhalgi shrirangmhalgi force-pushed the SPARK-56739-normalize-orphan-cte-refs branch from 4d36b6e to 59ebea5 Compare May 24, 2026 04:50
@shrirangmhalgi shrirangmhalgi marked this pull request as ready for review May 24, 2026 16:33
@shrirangmhalgi
Copy link
Copy Markdown
Contributor Author

@yaooqinn / @cloud-fan could you please review the following PR.

@cloud-fan
Copy link
Copy Markdown
Contributor

Does #55985 fix this issue?

@rdtr
Copy link
Copy Markdown

rdtr commented May 26, 2026

@cloud-fan I do not think #55985 fixes this. They address different symptoms of the same root cause.

#55985 fixes double-mapping of nested CTE refs: canonicalizeCTE crosses WithCTE boundaries while the shared mutable map accumulates ids from sibling scopes, causing refs to be transformed twice (e.g., 0→1→0).

This PR fixes a different issue, orphan CTERelationRef nodes that end up outside any WithCTE after InlineCTE drops the wrapper. Since NormalizeCTEIds only transforms refs inside the WithCTE case
handler via canonicalizeCTE, these orphans are never reached and keep their original ids, breaking plan caching.

Both stem from the same design issue: NormalizeCTEIds interleaves building the id mapping with applying it, and only applies it within WithCTE handlers.

One concern with the current approach: the second-pass transformDownWithSubqueries won't reach orphan refs inside CacheTableAsSelect.plan, because CacheTableAsSelect extends Command (whose children returns Nil), so plan is hidden from tree traversal. It may lead to a cache miss.

A potentially simpler unified fix would be to separate the two concerns: collect all CTE def ids first, then apply the mapping globally (2-pass):

override def apply(plan: LogicalPlan): LogicalPlan = {
    val curId = new AtomicLong()
    val cteIdToNewId = mutable.Map.empty[Long, Long]
    collectCTEIds(plan, curId, cteIdToNewId)
    if (cteIdToNewId.isEmpty) plan else applyMapping(plan, cteIdToNewId)
  }

  private def collectCTEIds(
      plan: LogicalPlan, curId: AtomicLong, map: mutable.Map[Long, Long]): Unit = {
    plan match {
      case ctas: CacheTableAsSelect => collectCTEIds(ctas.plan, curId, map)
      case WithCTE(_, cteDefs) =>
        cteDefs.foreach(d => map.getOrElseUpdate(d.id, curId.getAndIncrement()))
      case _ =>
    }
    plan.children.foreach(collectCTEIds(_, curId, map))
    plan.expressions.foreach(_.foreachUp {
      case s: SubqueryExpression => collectCTEIds(s.plan, curId, map)
      case _ =>
    })
  }

  private def applyMapping(
      plan: LogicalPlan, map: mutable.Map[Long, Long]): LogicalPlan = {
    plan.transformDownWithSubqueries {
      case ctas: CacheTableAsSelect =>
        ctas.copy(plan = applyMapping(ctas.plan, map))
      case cteDef: CTERelationDef if map.contains(cteDef.id) =>
        cteDef.copy(id = map(cteDef.id))
      case ref: CTERelationRef if map.contains(ref.cteId) =>
        ref.copy(cteId = map(ref.cteId))
      case unionLoop: UnionLoop if map.contains(unionLoop.id) =>
        unionLoop.copy(id = map(unionLoop.id))
      case unionLoopRef: UnionLoopRef if map.contains(unionLoopRef.loopId) =>
        unionLoopRef.copy(loopId = map(unionLoopRef.loopId))
    }
  }

This fixes both issues: no double-mapping (each node visited once in Pass 2), no orphan misses (Pass 2 transforms all refs globally), CTAS handled explicitly in both passes, and eliminates canonicalizeCTE entirely.

Happy to help with a PR if this direction is interesting.

import org.apache.spark.sql.types.IntegerType

class NormalizeCTEIdsSuite extends SparkFunSuite {
test("SPARK-56739: orphan CTERelationRef outside WithCTE should be normalized") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an end-to-end test, we can also say the bug is where orphan CTERelationRef were created, and NormalizeCTEIds does nothing wrong.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is a unit test. However, NormalizeCTEIds is a normalization rule - its contract is to produce deterministic IDs for any valid plan shape. The test constructs a valid post-optimization state (orphan ref after InlineCTE removes the wrapper), and the rule fails to normalize it. Whether the bug manifests as a cache miss or a crash depends on downstream consumers, but the normalization contract is broken regardless. I've rebased on top of #55985 (now merged) and the fix works cleanly as a complement to it. Happy to adjust the approach if you'd prefer fixing this at the InlineCTE level instead.

@shrirangmhalgi
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough analysis @rdtr. CacheTableAsSelect concern is valid and is fixed by #55985 - the existing applyInternal recursive call for CTAS should cover it since the second pass runs at the end of each applyInternal invocation. I've rebased on top of #55985 which addresses the nested-scope issue you described.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants