Skip to content

[SPARK-57040][SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM#56092

Closed
pan3793 wants to merge 3 commits into
apache:masterfrom
pan3793:SPARK-57040
Closed

[SPARK-57040][SQL] JDBC connector supports pushdown TABLESAMPLE SYSTEM#56092
pan3793 wants to merge 3 commits into
apache:masterfrom
pan3793:SPARK-57040

Conversation

@pan3793
Copy link
Copy Markdown
Member

@pan3793 pan3793 commented May 24, 2026

What changes were proposed in this pull request?

This PR contains 3 parts:

  1. JdbcDialect API change - deprecate def supportsTableSample: Boolean and def getTableSample(sample: TableSampleInfo): String, introduce a def compileTableSample(sample: TableSampleInfo): Option[String] as the replacement.
    1.1. this is a correctness fix - PostgreSQL and Databricks SQL do not support withReplacement = true (actually, it seems no mainstream RDBMS or OLAP engine supports TABLESAMPLE with replacement, Spark only supports it in DataFrame API), but the current implementation ignores withReplacement and always treats it as withReplacement = false, which is incorrect.
    1.2. it's a pre-step to support TABLESAMPLE SYSTEM as some RDBMSs may only support TABLESAMPLE BERNOULLI but not TABLESAMPLE SYSTEM.

  2. Mark the old pushTableSample method as deprecated and suggest the new API, connectors are suggested to only implement the new pushTableSample that has a sampleMethod parameter.

  3. Extend the built-in JDBC connector to support pushdown TABLESAMPLE SYSTEM, for now, this is only applicable to the PostgreSQL dialect.

Why are the changes needed?

Correctness fix - TABLESAMPLE withReplacement = true should not push down to PostgreSQL and Databricks JDBC connector.

Make JdbcDialects API more flexible - RDBMS may partially support TABLESAMPLE, now they can answer whether the expression can be pushed down by testing the real TABLESAMPLE expression instead of blindly answering yes/no without knowing the input.

Extend the built-in JDBC connector feature.

Does this PR introduce any user-facing change?

Yes, see the above section for details.

How was this patch tested?

New UT/IT cases are added.

Was this patch authored or co-authored using generative AI tooling?

Contains Content Generated-by: MiMo-V2.5-Pro

*/
def withTableSample(sample: TableSampleInfo): JdbcSQLQueryBuilder = {
tableSampleClause = dialect.getTableSample(sample)
def withTableSampleClause(clause: String): JdbcSQLQueryBuilder = {
Copy link
Copy Markdown
Member Author

@pan3793 pan3793 May 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MIMA complains about it - this changes the propagation from the uncompiled TableSampleInfo to the compiled TABLESAMPLE ... clause, to avoid calling the JdbcDialect.compileTableSample again at the end.

override def getTableSample(sample: TableSampleInfo): String = {
s"TABLESAMPLE (${(sample.upperBound - sample.lowerBound) * 100}) REPEATABLE (${sample.seed})"
override def compileTableSample(sample: TableSampleInfo): Option[String] = {
if (sample.withReplacement || sample.sampleMethod == SampleMethod.System) return None
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-sampling, Databricks SQL does not support TABLESAMPLE SYSTEM

* Pushes down BERNOULLI (row-level) SAMPLE to the data source.
*/
boolean pushTableSample(
@Deprecated(since = "4.2.0")
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, this method should be treated as deprecated after SPARK-55978 (4.2.0), connectors do not need to implement this method if they already implement the new pushTableSample method that has an additional sampleMethod parameter.

@pan3793
Copy link
Copy Markdown
Member Author

pan3793 commented May 24, 2026

@stanyao @szehon-ho @cloud-fan @huaxingao could you please take a look? this might need to be included in 4.2 (entirely or partially)

MIMA failures will be fixed after deciding on #56092 (comment)

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior state and problem. The dialect API exposes def supportsTableSample: Boolean and def getTableSample(sample): String. The dialect has to answer the support question without ever seeing the sample, which has two consequences:

  • PostgreSQL has historically returned supportsTableSample = true and ignored withReplacement — silently treating withReplacement = true as false. PG (and effectively every mainstream RDBMS) does not support replacement sampling, so this is a correctness bug.
  • After SPARK-55978 added SampleMethod (BERNOULLI/SYSTEM) to the scanBuilder pushdown API in 4.2.0, the dialect-side still has no way to say "I can compile BERNOULLI but not SYSTEM" — it can only blanket-answer the support question.

Design approach. Replace the yes/no pair with a single compileTableSample(sample): Option[String] that returns Some(clause) when the dialect can compile the given sample and None to refuse pushdown. This shifts the decision from "static capability" to "per-request compilability" — symmetric with the 5-arg SupportsPushDownTableSample.pushTableSample. The old methods stay (@deprecated("Use compileTableSample instead", "4.2.0")) with a default compileTableSample impl delegating to them for backward compat.

JDBCScanBuilder overrides the 5-arg pushTableSample, converts the V2 SampleMethod via a new V2ExpressionUtils.toCatalyst, calls dialect.compileTableSample, and stores the compiled clause directly. JdbcSQLQueryBuilder switches from withTableSample(TableSampleInfo) to withTableSampleClause(String) so the builder no longer needs the dialect at SQL-emit time.

PostgresDialect now rejects withReplacement = true (the cited correctness fix) and supports SYSTEM. DatabricksDialect also rejects withReplacement = true symmetrically (worth a one-line mention in the PR description, since the description currently names only PG) and continues to reject SYSTEM, per Databricks SQL not supporting it.

Key design decisions.

  • Returning Option[String] rather than throwing lets the caller transparently fall back to client-side sampling for BERNOULLI, while V2ScanRelationPushDown already surfaces UNSUPPORTED_FEATURE.TABLESAMPLE_SYSTEM for SYSTEM. Clean fit with the existing failure path.
  • Pre-compiling the clause in the scanBuilder and passing the string into JdbcSQLQueryBuilder avoids re-invoking compileTableSample at build time. The trade-off is that removing withTableSample(TableSampleInfo) is a @since 3.5.0 ABI break — see inline comment for a bridge that resolves this without an MIMA exclusion.

Implementation sketch.

  • SupportsPushDownTableSample.java — old 4-arg pushTableSample becomes @Deprecated(since = "4.2.0") with a default body that throws mustOverrideOneMethodError. Binary-safe; existing 3.x overrides keep working.
  • JdbcDialect — adds compileTableSample; deprecates supportsTableSample and getTableSample.
  • JDBCScanBuilder, JDBCScan, JDBCV1RelationFromV2Scan, JDBCRelation, JDBCRDD — field type changes from Option[TableSampleInfo] to Option[String]; one test (JdbcTaskInterruptSuite) updated for the renamed parameter. These are all in internal packages.
  • PostgresIntegrationSuite — opts into the new SYSTEM tests via supportsTableSampleSystem = true.
  • V2JDBCTest — adds supportsTableSampleSystem (default false), a SYSTEM gridTest block, and a new withReplacement test.

The most consequential finding is on the backward-compat path for third-party dialects compiled against released Spark (3.5.x): the default compileTableSample doesn't gate on sample.sampleMethod or sample.withReplacement, so a dialect that overrode the old supportsTableSample = true path will now be asked to compile SYSTEM and withReplacement samples — and silently return a BERNOULLI / replacement-ignoring clause. This re-introduces in third-party dialects the same correctness bug the PR fixes for built-in PG. See the inline comment.

Comment on lines +887 to +889
def compileTableSample(sample: TableSampleInfo): Option[String] = {
if (supportsTableSample) Some(getTableSample(sample)) else None
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default delegation to getTableSample(sample) ignores sample.sampleMethod and sample.withReplacement. A third-party dialect compiled against 3.5.x that opted in via the deprecated supportsTableSample = true / getTableSample(...) was written under the old contract (BERNOULLI only, withReplacement blindly accepted). With the PR's change, JDBCScanBuilder.pushTableSample no longer gates on dialect.supportsTableSample and always calls compileTableSample with the real sampleMethod / withReplacement. The result: such a dialect is asked to compile a SYSTEM-or-withReplacement sample and silently returns the old BERNOULLI-style string — Spark believes the sample was pushed down with the requested semantics, the connector runs the wrong semantics. This is the exact correctness bug the PR cites as motivation, just shifted onto legacy third-party dialects.

Suggested fix — gate the legacy path to the original contract:

Suggested change
def compileTableSample(sample: TableSampleInfo): Option[String] = {
if (supportsTableSample) Some(getTableSample(sample)) else None
}
def compileTableSample(sample: TableSampleInfo): Option[String] = {
if (supportsTableSample &&
!sample.withReplacement &&
sample.sampleMethod == SampleMethod.Bernoulli) {
Some(getTableSample(sample))
} else {
None
}
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, fixed

*/
def withTableSample(sample: TableSampleInfo): JdbcSQLQueryBuilder = {
tableSampleClause = dialect.getTableSample(sample)
def withTableSampleClause(clause: String): JdbcSQLQueryBuilder = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: your self-comment on MIMA — JdbcSQLQueryBuilder is @since 3.5.0 and is the documented extension point (Oracle / MySQL subclass it via getJdbcSQLQueryBuilder, third parties can too), so removing withTableSample(TableSampleInfo) is a real released-ABI break, not just MIMA noise to exclude. Cleanest fix is to keep it as a deprecated bridge:

@deprecated("Use withTableSampleClause(String) instead", "4.2.0")
def withTableSample(sample: TableSampleInfo): JdbcSQLQueryBuilder = {
  tableSampleClause = dialect.getTableSample(sample)
  this
}

def withTableSampleClause(clause: String): JdbcSQLQueryBuilder = {
  tableSampleClause = clause
  this
}

I'd keep the bridge body calling getTableSample rather than compileTableSample for behavior parity — the old method body was identical, so any third-party caller sees the same behavior as before.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adopted, and verified locally by runningdev/MIMA

if (partitioningEnabled) {
multiplePartitionAdditionalCheck(df1, partitionInfo)
}
assert(df6.collect().length <= 10)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df6 is bounded by LIMIT 2 (line 781), so df6.collect().length <= 10 is trivially true and never exercises the SYSTEM-sampled row count. The two df6 references in the new SYSTEM block (this line and line 824) look like copy-paste from the BERNOULLI cases — they should reference df9 and df10 respectively. With PG TABLESAMPLE SYSTEM on a 10-row, single-block table at 50%, the result will commonly be all 10 rows or 0 rows, so consider asserting something stronger than <= 10, e.g. also verifying that the scan plan reflects column pruning for df10 by inspecting collected rows.

Copy link
Copy Markdown
Member Author

@pan3793 pan3793 May 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the copy-paste issue - wrong variable reference.

With PG TABLESAMPLE SYSTEM on a 10-row, single-block table at 50%, the result will commonly be all 10 rows or 0 rows, so consider asserting something stronger than <= 10

it's true, but I think it's about PG implementation details, not something in the contract - it has no guarantee that a few rows will be stored in a single physical block, so I keep the <= 10 assertion.

Comment thread sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala Outdated
@cloud-fan
Copy link
Copy Markdown
Contributor

for 4.2.0 necessarily: does this PR change any unreleased API of 4.2.0?

@pan3793
Copy link
Copy Markdown
Member Author

pan3793 commented May 25, 2026

for 4.2.0 necessarily: does this PR change any unreleased API of 4.2.0?

@cloud-fan, maybe this one https://github.com/apache/spark/pull/56092/changes#r3295020608 - I mentioned it in the PR description part (2)

and as this includes a correctness fix, we'd better include it in the 4.2. if you have concerns about "it brings a new feature - PG connector supports pushdown TABLESAMPLE SYSTEM", I can split this part into a dedicated PR and target 4.3.

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and +1 for 4.2.0

@pan3793 pan3793 closed this in 499172a May 26, 2026
pan3793 added a commit that referenced this pull request May 26, 2026
### What changes were proposed in this pull request?

This PR contains 3 parts:

1. `JdbcDialect` API change - deprecate `def supportsTableSample: Boolean` and `def getTableSample(sample: TableSampleInfo): String`, introduce a `def compileTableSample(sample: TableSampleInfo): Option[String]` as the replacement.
  1.1. this is a correctness fix - PostgreSQL and Databricks SQL do not support `withReplacement = true` (actually, it seems no mainstream RDBMS or OLAP engine supports TABLESAMPLE with replacement, Spark only supports it in DataFrame API), but the current implementation ignores `withReplacement` and always treats it as `withReplacement = false`, which is incorrect.
  1.2. it's a pre-step to support `TABLESAMPLE SYSTEM` as some RDBMSs may only support `TABLESAMPLE BERNOULLI` but not `TABLESAMPLE SYSTEM`.

2. Mark the old `pushTableSample` method as deprecated and suggest the new API, connectors are suggested to only implement the new `pushTableSample` that has a `sampleMethod` parameter.

3. Extend the built-in JDBC connector to support pushdown `TABLESAMPLE SYSTEM`, for now, this is only applicable to the PostgreSQL dialect.

### Why are the changes needed?

Correctness fix - TABLESAMPLE `withReplacement = true` should not push down to PostgreSQL and Databricks JDBC connector.

Make `JdbcDialects` API more flexible - RDBMS may partially support TABLESAMPLE, now they can answer whether the expression can be pushed down by testing the real TABLESAMPLE expression instead of blindly answering yes/no without knowing the input.

Extend the built-in JDBC connector feature.

### Does this PR introduce _any_ user-facing change?

Yes, see the above section for details.

### How was this patch tested?

New UT/IT cases are added.

### Was this patch authored or co-authored using generative AI tooling?

Contains Content Generated-by: MiMo-V2.5-Pro

Closes #56092 from pan3793/SPARK-57040.

Authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
(cherry picked from commit 499172a)
Signed-off-by: Cheng Pan <chengpan@apache.org>
pan3793 added a commit that referenced this pull request May 26, 2026
### What changes were proposed in this pull request?

This PR contains 3 parts:

1. `JdbcDialect` API change - deprecate `def supportsTableSample: Boolean` and `def getTableSample(sample: TableSampleInfo): String`, introduce a `def compileTableSample(sample: TableSampleInfo): Option[String]` as the replacement.
  1.1. this is a correctness fix - PostgreSQL and Databricks SQL do not support `withReplacement = true` (actually, it seems no mainstream RDBMS or OLAP engine supports TABLESAMPLE with replacement, Spark only supports it in DataFrame API), but the current implementation ignores `withReplacement` and always treats it as `withReplacement = false`, which is incorrect.
  1.2. it's a pre-step to support `TABLESAMPLE SYSTEM` as some RDBMSs may only support `TABLESAMPLE BERNOULLI` but not `TABLESAMPLE SYSTEM`.

2. Mark the old `pushTableSample` method as deprecated and suggest the new API, connectors are suggested to only implement the new `pushTableSample` that has a `sampleMethod` parameter.

3. Extend the built-in JDBC connector to support pushdown `TABLESAMPLE SYSTEM`, for now, this is only applicable to the PostgreSQL dialect.

### Why are the changes needed?

Correctness fix - TABLESAMPLE `withReplacement = true` should not push down to PostgreSQL and Databricks JDBC connector.

Make `JdbcDialects` API more flexible - RDBMS may partially support TABLESAMPLE, now they can answer whether the expression can be pushed down by testing the real TABLESAMPLE expression instead of blindly answering yes/no without knowing the input.

Extend the built-in JDBC connector feature.

### Does this PR introduce _any_ user-facing change?

Yes, see the above section for details.

### How was this patch tested?

New UT/IT cases are added.

### Was this patch authored or co-authored using generative AI tooling?

Contains Content Generated-by: MiMo-V2.5-Pro

Closes #56092 from pan3793/SPARK-57040.

Authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
(cherry picked from commit 499172a)
Signed-off-by: Cheng Pan <chengpan@apache.org>
@pan3793
Copy link
Copy Markdown
Member Author

pan3793 commented May 26, 2026

thanks, merged to master/4.x/4.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants