Initial work for file format writer API#3119
Conversation
|
CC: @kevinjqliu @Fokko @geruh for review |
| OutputFile, | ||
| OutputStream, | ||
| ) | ||
| from pyiceberg.io.fileformat import DataFileStatistics as DataFileStatistics |
There was a problem hiding this comment.
| from pyiceberg.io.fileformat import DataFileStatistics as DataFileStatistics | |
| from pyiceberg.io.fileformat import DataFileStatistics |
There was a problem hiding this comment.
mypy wasn't happy about this previously: https://github.com/apache/iceberg-python/actions/runs/22681243975/job/65752048019
| _result: DataFileStatistics | None = None | ||
|
|
||
| @abstractmethod | ||
| def write(self, table: pa.Table) -> None: |
There was a problem hiding this comment.
A table looks to be the logical starting point, but I think an iterator of RecordBatches would also make sense. WDYT @kevinjqliu
| def partition(self, partition_spec: PartitionSpec, schema: Schema) -> Record: | ||
| return Record(*[self._partition_value(field, schema) for field in partition_spec.fields]) | ||
|
|
||
| def to_serialized_dict(self) -> dict[str, Any]: |
There was a problem hiding this comment.
Might be nice to change this into a TypedDict as a return type
There was a problem hiding this comment.
I moved it over from the original implementation. I can do a TypedDict in a follow up when I wire it through if that works?
| def get(cls, file_format: FileFormat) -> FileFormatModel: | ||
| if file_format not in cls._registry: | ||
| raise ValueError(f"No writer registered for {file_format}. Available: {list(cls._registry.keys())}") | ||
| return cls._registry[file_format] |
There was a problem hiding this comment.
I think PyIceberg diverges a bit from Java on this point. PyIceberg could have multiple implementatons for Parquet for example (Arrow/fsspec). Maybe we want something similar to the FileIO loading:
iceberg-python/pyiceberg/io/__init__.py
Line 303 in 82f6040
There was a problem hiding this comment.
I implemented the FileFormatFactory as the Python equivalent of Java's FormatModelRegistry, keyed by FileFormat alone since Python only has Arrow (vs Java needing (FileFormat, Class<?>) for Spark/Flink/Generic). Let me know if you think it's worth adding a property-based override.
Initial work for #3100. Since this is a large change, doing it in parts similar to the
AuthManagerso it's easier to review and move the existing code around.Rationale for this change
Introduces the pluggable file format writer API:
FileFormatWriter,FileFormatModel, andFileFormatFactoryinpyiceberg/io/fileformat.py. MovesDataFileStatisticsfrompyarrow.pywith are-export for backward compatibility. The move is more forward looking and the idea is to keep the stats generic in the future as we add additional formats too.
This is the first part of work for #3100. No behavioral changes; the write path remains hardcoded to Parquet.
Are these changes tested?
Yes.
tests/io/test_fileformat.pytests backward-compatible import ofDataFileStatisticsAre there any user-facing changes?
No