Initial work for file format writer API by nssalian · Pull Request #3119 · apache/iceberg-python

nssalian · 2026-03-03T18:59:23Z

Initial work for #3100. Since this is a large change, doing it in parts similar to the AuthManager so it's easier to review and move the existing code around.

Rationale for this change

Introduces the pluggable file format writer API: FileFormatWriter, FileFormatModel, and
FileFormatFactory in pyiceberg/io/fileformat.py. Moves DataFileStatistics from pyarrow.py with a
re-export for backward compatibility. The move is more forward looking and the idea is to keep the stats generic in the future as we add additional formats too.

This is the first part of work for #3100. No behavioral changes; the write path remains hardcoded to Parquet.

Are these changes tested?

Yes. tests/io/test_fileformat.py tests backward-compatible import of DataFileStatistics

Are there any user-facing changes?

No

nssalian · 2026-03-06T16:46:58Z

CC: @kevinjqliu @Fokko @geruh for review

Fokko · 2026-03-26T18:44:52Z

pyiceberg/io/pyarrow.py

    OutputFile,
    OutputStream,
 )
+from pyiceberg.io.fileformat import DataFileStatistics as DataFileStatistics


Suggested change

from pyiceberg.io.fileformat import DataFileStatistics as DataFileStatistics

from pyiceberg.io.fileformat import DataFileStatistics

mypy wasn't happy about this previously: https://github.com/apache/iceberg-python/actions/runs/22681243975/job/65752048019

Fokko · 2026-03-26T18:47:24Z

pyiceberg/io/fileformat.py

+    _result: DataFileStatistics | None = None
+
+    @abstractmethod
+    def write(self, table: pa.Table) -> None:


A table looks to be the logical starting point, but I think an iterator of RecordBatches would also make sense. WDYT @kevinjqliu

Fokko · 2026-03-26T18:50:23Z

pyiceberg/io/fileformat.py

+    def partition(self, partition_spec: PartitionSpec, schema: Schema) -> Record:
+        return Record(*[self._partition_value(field, schema) for field in partition_spec.fields])
+
+    def to_serialized_dict(self) -> dict[str, Any]:


Might be nice to change this into a TypedDict as a return type

I moved it over from the original implementation. I can do a TypedDict in a follow up when I wire it through if that works?

pyiceberg/io/fileformat.py

Fokko · 2026-03-26T18:54:59Z

pyiceberg/io/fileformat.py

+    def get(cls, file_format: FileFormat) -> FileFormatModel:
+        if file_format not in cls._registry:
+            raise ValueError(f"No writer registered for {file_format}. Available: {list(cls._registry.keys())}")
+        return cls._registry[file_format]


I think PyIceberg diverges a bit from Java on this point. PyIceberg could have multiple implementatons for Parquet for example (Arrow/fsspec). Maybe we want something similar to the FileIO loading:

iceberg-python/pyiceberg/io/__init__.py

Line 303 in 82f6040

SCHEMA_TO_FILE_IO: dict[str, list[str]] = {

I implemented the FileFormatFactory as the Python equivalent of Java's FormatModelRegistry, keyed by FileFormat alone since Python only has Arrow (vs Java needing (FileFormat, Class<?>) for Spark/Flink/Generic). Let me know if you think it's worth adding a property-based override.

Initial work for file format writer API

ca2a398

nssalian marked this pull request as ready for review March 3, 2026 19:01

nssalian added 2 commits March 4, 2026 09:31

Nit for CI fix

7d608d6

fix for mypy

0505cca

Fokko reviewed Mar 26, 2026

View reviewed changes

pyiceberg/io/fileformat.py Show resolved Hide resolved

Fokko reviewed Mar 26, 2026

View reviewed changes

pyiceberg/io/fileformat.py Show resolved Hide resolved

Fokko reviewed Mar 26, 2026

View reviewed changes

nssalian added 2 commits March 26, 2026 18:22

Merge remote-tracking branch 'apache/main' into file-format-initial-work

50f5270

Add test for result none

dea73b2

nssalian requested a review from Fokko March 27, 2026 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial work for file format writer API#3119

Initial work for file format writer API#3119
nssalian wants to merge 5 commits intoapache:mainfrom
nssalian:file-format-initial-work

nssalian commented Mar 3, 2026 •

edited

Loading

Uh oh!

nssalian commented Mar 6, 2026

Uh oh!

Fokko Mar 26, 2026

Uh oh!

nssalian Mar 27, 2026

Uh oh!

Fokko Mar 26, 2026

Uh oh!

Fokko Mar 26, 2026

Uh oh!

nssalian Mar 27, 2026

Uh oh!

Uh oh!

Uh oh!

Fokko Mar 26, 2026

Uh oh!

nssalian Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	from pyiceberg.io.fileformat import DataFileStatistics as DataFileStatistics
	from pyiceberg.io.fileformat import DataFileStatistics

Conversation

nssalian commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

nssalian commented Mar 6, 2026

Uh oh!

Fokko Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

nssalian Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

nssalian Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Fokko Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

nssalian Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nssalian commented Mar 3, 2026 •

edited

Loading