Skip to content

[spark] Push down variant_get into Paimon shredded Variant scan#7657

Open
chenghuichen wants to merge 5 commits intoapache:masterfrom
chenghuichen:spark_variant
Open

[spark] Push down variant_get into Paimon shredded Variant scan#7657
chenghuichen wants to merge 5 commits intoapache:masterfrom
chenghuichen:spark_variant

Conversation

@chenghuichen
Copy link
Copy Markdown
Contributor

@chenghuichen chenghuichen commented Apr 15, 2026

Purpose

Queries like SELECT variant_get(v, '$.age', 'int') FROM T on a shredded Variant column still read all sub-columns and reassemble the full binary Variant, leaving Paimon's VariantRowType / clipVariantType infrastructure unused.

This PR adds PushDownVariantExtract (Spark 4 only), a Catalyst optimizer rule that replaces VariantGet with GetStructField and sets variantProjections on PaimonScan, so only the accessed typed_value.* Parquet sub-columns are read.

The rule runs in the "User Provided Optimizers" batch (via experimentalMethods.extraOptimizations) to ensure it fires after V2ScanRelationPushDown has built the scan relation.

Part of #4471

Note: Spark 4.0 lacks a V2-compatible variant push-down interface (SupportsPushDownVariantExtractions was introduced in 4.1), so registering a custom optimizer rule via experimentalMethods.extraOptimizations is the right fit for 4.0. For a future paimon-spark-4.1 module, a cleaner approach would be implementing SupportsPushDownVariantExtractions on PaimonScan and letting Spark's built-in V2ScanRelationPushDown handle the rewrite natively.

Tests

VariantTest.scala::VariantPushDownPlanTest (paimon-spark-4.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant