fix(spark): parse month, day without leading zeros by deepyaman · Pull Request #7773 · tobymao/sqlglot

deepyaman · 2026-06-20T17:31:34Z

Regarding testing each dialect, I ended up testing Ibis backends, which largely supports the earlier conclusion that Spark 3+ (and Databricks) were the issues:

Verdict	Backends
✅ Works	DuckDB, Polars, BigQuery, MySQL, PostgreSQL, Trino, SingleStoreDB, Oracle, RisingWave
❌ Returns NULL silently	Impala, PySpark, Databricks
❌ Returns wrong date	Flink
— Not implemented	ClickHouse, SQLite, DataFusion, MS SQL Server, Druid, Exasol
— Not available	Materialize

Impala and Flink don't have SQLGlot dialect (although Ibis extends the Hive dialect for Impala), and Flink issues specifically run deeper. I can try to spend some time testing the backends Ibis doesn't implement this functionality for separately, if helpful.

Note: I did use Claude specifically to implement the additions in this PR, as well as to test across Ibis backends, but I have reviewed the code/implementation and think it makes sense.

Spark 3+ parses MM/dd strictly (single-digit months/days don't parse), unlike the lax %m/%d most dialects produce. Map Spark's MM/dd to a distinct canonical token (%mstrict/%dstrict) only when parsing, so the strict format roundtrips while lax %m/%d still becomes the lenient M/d. Formatting keeps the padded %m/%d -> MM/dd. The internal tokens degrade to %m/%d for every other dialect via the metaclass inverse fallback and the generic strtotime_sql. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

Replace the per-call local import and `assert isinstance(self.dialect, Spark)` in SparkGenerator.format_time with a TYPE_CHECKING-guarded `dialect: Spark` class annotation, removing the runtime overhead while keeping mypy happy. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

Removes the pyspark-specific StringToDate/StringToTimestamp override that mapped Python strftime tokens to lenient single-letter Spark specifiers. This is handled upstream by tobymao/sqlglot#7773, which makes Spark's StrToDate/StrToTime emit lenient `M`/`d` while keeping strict `MM`/`dd` only when round-tripping Spark's own SQL. Until that sqlglot release is available and the lower bound is bumped, the released sqlglot emits strict `MM/dd` and Spark returns NULL for single-digit month/day, so mark the single-digit `as_date` tests `notyet` for pyspark/databricks (mirroring impala). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… [revert before merge] Temporary: installs the patched sqlglot (tobymao/sqlglot#7773) in the test_pyspark job via a submodule-stripped clone + `uv add` path source, and drops the pyspark/databricks `notyet` markers so the single-digit `as_date` tests are expected to PASS. This proves the PR's generated SQL actually executes single-digit dates in strict (non-legacy) Spark, not just an identical-format inference. Revert this commit; it must not merge. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CI on the override-free code revealed that Spark 3+ in strict (non-legacy) mode does not silently return NULL for single-digit %m/%d — it raises: SparkUpgradeException on Spark 3.5 and DateTimeException on Spark 4.0 (both subclasses of pyspark's base PySparkException, including the Spark Connect variant). Databricks raises DatabricksServerOperationError. Point the pyspark/databricks `notyet` markers at those exceptions instead of AssertionError so the single-digit `as_date` tests xfail cleanly until the sqlglot fix (tobymao/sqlglot#7773) is released and the lower bound is bumped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The strict->lax canonical fallback was applied only to INVERSE_TIME_MAPPING, so BigQuery's `FORMAT '...'` clause (which uses INVERSE_FORMAT_MAPPING) leaked the internal token, e.g. Spark `TO_DATE(x, 'MM/dd/yyyy')` -> BigQuery `... FORMAT 'MMstrict/DDstrict/YYYY'`. Apply the same fallback to INVERSE_FORMAT_MAPPING via a shared helper so it degrades to 'MM/DD/YYYY'. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

…LAUDE Reuse dialect.STRICT_PARSE_TIME_EXPRESSIONS in SparkGenerator.format_time instead of a duplicate LENIENT_TIME_EXPRESSIONS tuple, removing the cross-module constant that had to be kept in sync by hand. Also document why the base strtotime_sql degrades only the strict tokens rather than routing the whole format through self.format_time(). Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

Declare LENIENT_INVERSE_TIME_MAPPING/_TRIE on the base Dialect (mirroring STRICT_TIME_MAPPING and INVERSE_TIME_MAPPING) and build the trie in the metaclass. This drops the Spark-only attribute that forced a `dialect: Spark` type-narrowing annotation in SparkGenerator plus its TYPE_CHECKING import, and lets spark.py stop hand-building the trie. No behavior change. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

…CLAUDE Replace the `inverse_time_mapping is None and ...` guard with the same `inverse_time_mapping or ...` fallback idiom the base Generator.format_time uses. Same semantics (an explicitly-passed mapping still wins), but consistent with the base and without the extra clause. No behavior change. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

…AUDE Drop the "used by the generator's format_time override / trie built by the metaclass" note on Spark.LENIENT_INVERSE_TIME_MAPPING (no other mapping documents where it's consumed or that the metaclass builds its trie), and reword the base docstring's awkward "Lets e.g. Spark" to the parenthetical "(e.g. Spark)" form already used by STRICT_TIME_MAPPING. Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>

deepyaman added 5 commits June 10, 2026 17:05

fix(spark): parse month, day without leading zeros

4806819

fix(spark): also fix generation of exp.StrToTime

4c2e16b

chore(spark): narrow type to fix mypy attr-defined

af89ba3

deepyaman marked this pull request as ready for review June 20, 2026 17:31

deepyaman mentioned this pull request Jun 20, 2026

test: add single-digit month/day date/timestamp cast tests (#12004) ibis-project/ibis#12021

Draft

deepyaman added 2 commits June 20, 2026 14:56

georgesittas requested a review from geooo109 June 21, 2026 14:18

georgesittas assigned geooo109 Jun 21, 2026

geooo109 reviewed Jun 22, 2026

View reviewed changes

Comment thread sqlglot/generators/spark.py Outdated

Comment thread sqlglot/generators/spark.py Outdated

Comment thread sqlglot/generators/spark.py Outdated

deepyaman added 3 commits June 22, 2026 18:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spark): parse month, day without leading zeros#7773

fix(spark): parse month, day without leading zeros#7773
deepyaman wants to merge 10 commits into
tobymao:mainfrom
deepyaman:patch-1

deepyaman commented Jun 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

deepyaman commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

deepyaman commented Jun 20, 2026 •

edited

Loading