Skip to content

fix(spark): parse month, day without leading zeros#7773

Open
deepyaman wants to merge 10 commits into
tobymao:mainfrom
deepyaman:patch-1
Open

fix(spark): parse month, day without leading zeros#7773
deepyaman wants to merge 10 commits into
tobymao:mainfrom
deepyaman:patch-1

Conversation

@deepyaman

@deepyaman deepyaman commented Jun 20, 2026

Copy link
Copy Markdown

Following up on #7739 (comment)

Regarding testing each dialect, I ended up testing Ibis backends, which largely supports the earlier conclusion that Spark 3+ (and Databricks) were the issues:

Verdict Backends
✅ Works DuckDB, Polars, BigQuery, MySQL, PostgreSQL, Trino, SingleStoreDB, Oracle, RisingWave
❌ Returns NULL silently Impala, PySpark, Databricks
❌ Returns wrong date Flink
— Not implemented ClickHouse, SQLite, DataFusion, MS SQL Server, Druid, Exasol
— Not available Materialize

Impala and Flink don't have SQLGlot dialect (although Ibis extends the Hive dialect for Impala), and Flink issues specifically run deeper. I can try to spend some time testing the backends Ibis doesn't implement this functionality for separately, if helpful.

Note: I did use Claude specifically to implement the additions in this PR, as well as to test across Ibis backends, but I have reviewed the code/implementation and think it makes sense.

Spark 3+ parses MM/dd strictly (single-digit months/days don't parse),
unlike the lax %m/%d most dialects produce. Map Spark's MM/dd to a distinct
canonical token (%mstrict/%dstrict) only when parsing, so the strict format
roundtrips while lax %m/%d still becomes the lenient M/d. Formatting keeps the
padded %m/%d -> MM/dd. The internal tokens degrade to %m/%d for every other
dialect via the metaclass inverse fallback and the generic strtotime_sql.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
Replace the per-call local import and `assert isinstance(self.dialect, Spark)`
in SparkGenerator.format_time with a TYPE_CHECKING-guarded `dialect: Spark`
class annotation, removing the runtime overhead while keeping mypy happy.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
@deepyaman deepyaman marked this pull request as ready for review June 20, 2026 17:31
deepyaman added a commit to deepyaman/ibis that referenced this pull request Jun 20, 2026
Removes the pyspark-specific StringToDate/StringToTimestamp override that
mapped Python strftime tokens to lenient single-letter Spark specifiers.
This is handled upstream by tobymao/sqlglot#7773,
which makes Spark's StrToDate/StrToTime emit lenient `M`/`d` while keeping
strict `MM`/`dd` only when round-tripping Spark's own SQL.

Until that sqlglot release is available and the lower bound is bumped, the
released sqlglot emits strict `MM/dd` and Spark returns NULL for single-digit
month/day, so mark the single-digit `as_date` tests `notyet` for
pyspark/databricks (mirroring impala).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
deepyaman added a commit to deepyaman/ibis that referenced this pull request Jun 20, 2026
… [revert before merge]

Temporary: installs the patched sqlglot (tobymao/sqlglot#7773) in the
test_pyspark job via a submodule-stripped clone + `uv add` path source, and
drops the pyspark/databricks `notyet` markers so the single-digit `as_date`
tests are expected to PASS. This proves the PR's generated SQL actually
executes single-digit dates in strict (non-legacy) Spark, not just an
identical-format inference.

Revert this commit; it must not merge.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
deepyaman added a commit to deepyaman/ibis that referenced this pull request Jun 20, 2026
CI on the override-free code revealed that Spark 3+ in strict (non-legacy)
mode does not silently return NULL for single-digit %m/%d — it raises:
SparkUpgradeException on Spark 3.5 and DateTimeException on Spark 4.0 (both
subclasses of pyspark's base PySparkException, including the Spark Connect
variant). Databricks raises DatabricksServerOperationError.

Point the pyspark/databricks `notyet` markers at those exceptions instead of
AssertionError so the single-digit `as_date` tests xfail cleanly until the
sqlglot fix (tobymao/sqlglot#7773) is released and the lower bound is bumped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The strict->lax canonical fallback was applied only to INVERSE_TIME_MAPPING,
so BigQuery's `FORMAT '...'` clause (which uses INVERSE_FORMAT_MAPPING) leaked
the internal token, e.g. Spark `TO_DATE(x, 'MM/dd/yyyy')` -> BigQuery
`... FORMAT 'MMstrict/DDstrict/YYYY'`. Apply the same fallback to
INVERSE_FORMAT_MAPPING via a shared helper so it degrades to 'MM/DD/YYYY'.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
…LAUDE

Reuse dialect.STRICT_PARSE_TIME_EXPRESSIONS in SparkGenerator.format_time
instead of a duplicate LENIENT_TIME_EXPRESSIONS tuple, removing the cross-module
constant that had to be kept in sync by hand. Also document why the base
strtotime_sql degrades only the strict tokens rather than routing the whole
format through self.format_time().

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread sqlglot/generators/spark.py Outdated
Comment thread sqlglot/generators/spark.py Outdated
Comment thread sqlglot/generators/spark.py Outdated
Declare LENIENT_INVERSE_TIME_MAPPING/_TRIE on the base Dialect (mirroring
STRICT_TIME_MAPPING and INVERSE_TIME_MAPPING) and build the trie in the
metaclass. This drops the Spark-only attribute that forced a `dialect: Spark`
type-narrowing annotation in SparkGenerator plus its TYPE_CHECKING import, and
lets spark.py stop hand-building the trie. No behavior change.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
…CLAUDE

Replace the `inverse_time_mapping is None and ...` guard with the same
`inverse_time_mapping or ...` fallback idiom the base Generator.format_time
uses. Same semantics (an explicitly-passed mapping still wins), but consistent
with the base and without the extra clause. No behavior change.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
…AUDE

Drop the "used by the generator's format_time override / trie built by the
metaclass" note on Spark.LENIENT_INVERSE_TIME_MAPPING (no other mapping documents
where it's consumed or that the metaclass builds its trie), and reword the base
docstring's awkward "Lets e.g. Spark" to the parenthetical "(e.g. Spark)" form
already used by STRICT_TIME_MAPPING.

Assisted-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants