-
Notifications
You must be signed in to change notification settings - Fork 4.5k
feat: Adds support for DATE, DATETIME and JSON BigQuery types in the Python SDK using Storage API #37253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: Adds support for DATE, DATETIME and JSON BigQuery types in the Python SDK using Storage API #37253
Conversation
Summary of ChangesHello @enzomaruffa, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Apache Beam Python SDK's BigQuery I/O capabilities by introducing support for Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
135eccb to
3d7b5b8
Compare
|
Assigning reviewers: R: @shunping for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
| "BYTES": bytes, | ||
| "TIMESTAMP": apache_beam.utils.timestamp.Timestamp, | ||
| "GEOGRAPHY": str, | ||
| "DATE": str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember DATE/DATETIME types were already supported in Python BigQuery Storage API. cc: @ahmedabu98
If there is ineed still gap, we should use Python language type rather than str. Similar treatment was done in Python JdbcIO (Jdbc Date type and time type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Abacn! From our experience, these are definitely missing. I am not sure if it has to do something also missing in under-the-hood xlang implementation that uses the Java SDK, but Beam would block us before even getting it deployed to Dataflow with these types being unsupported.
I'd love to be wrong and have a simpler solution instead of this, though 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DATE and DATETIME aren't supported yet (only TIMESTAMP)
But I'm not very keen on defaulting them to Strings, it's not very robust. We should choose a Python native type that is closer to what a DATE/DATETIME actually is.
Some options are discussed in https://s.apache.org/beam-timestamp-strategy (under "Python Nanosecond Support" --> "DateTime"). It'll take more work but I think it's a better longterm solution for Beam.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi all 👋 I've been working on this stuff with @enzomaruffa for a while. I totally understand that this solution very hacky but realistically we don't have the resources to dedicate to building anything more robust.
So my question essentially becomes: is there any way we can get support for all BQ types bumped in priority on your end? We've been following along/waiting for over a year at this point hoping that it would be solved and would love to be able to use python dataflow in production without maintaining our own hacky fork.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a middleground is to allow users to pass type_overrides, that override the default type mappings? This way beam can add official mappings later on, and using strings now is possible without needing future breaking changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@claudevdm sounds pretty reasonable. I have added the work here doing this + updated the PR description. We'll run some more tests with these tweaks internally to make sure they work and I can report back here later.
|
Reminder, please take a look at this pr: @shunping |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @claudevdm for label python. Available commands:
|
42b70d4 to
bd18242
Compare
Summary
This PR adds a
type_overridesparameter to BigQuery I/O operations, allowing users to specify custom BigQuery-to-Python type mappings. This addresses #25946 by enabling support for types likeDATE,DATETIME, andJSONwithout hardcoding them in the SDK - allowing flexible usage ofstrto solve a lot of use cases.Background
We were migrating a production Dataflow pipeline from Java to Python, using the Storage Write API with
use_beam_io_types=True(BEAM_ROW format). Our pipeline writes to BigQuery tables containingDATE,DATETIME, andJSONcolumns.The pipeline failed with:
Solution
First we had added
strtypes - older commits still here. Now, rather than hardcoding new type mappings (which may conflict with future official implementations), this PR adds atype_overridesparameter that lets users define their own mappings:Files modified:
bigquery.py: Addedtype_overridestoWriteToBigQueryandStorageWriteToBigQuerybigquery_tools.py: Addedtype_overridestoget_beam_typehints_from_tableschema()bigquery_schema_tools.py: Addedtype_overridestobq_field_to_type(),generate_user_type_from_bq_schema(),convert_to_usertype()Implementation Note
This PR initially added hardcoded mappings for DATE, DATETIME, and JSON types (similar to how GEOGRAPHY was added in #36121). Based on reviewer feedback, we refactored to the
type_overridesapproach instead. The commit history reflects this evolution - earlier commits add the hardcoded types, and later commits introducetype_overridesand remove the hardcoded mappings.How It Works
The implementation merges user-provided overrides with the default type mappings:
Testing
Added comprehensive tests for
type_overrides:ValueErrorwithout overridesThank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.