Implement write.parquet.row-group-size-bytes in the pyarrow writer by stephrb · Pull Request #3449 · apache/iceberg-python

stephrb · 2026-06-01T22:35:09Z

The pyiceberg writer has historically ignored
write.parquet.row-group-size-bytes (logging 'not implemented') and used only write.parquet.row-group-limit (rows). For wide tables that means a single row group ends up at gigabytes — e.g. 337 cols × 1,048,576 default rows ≈ 1.7 GiB uncompressed per row group — which drives the polars / pyarrow reader's decode peak into the tens of GiB on production reads.

Now write_file resolves row_group_size as
min(row_group_limit, row_group_size_bytes / bytes_per_row), where bytes_per_row is approximated from the in-memory arrow_table's nbytes. This matches Spark / parquet-mr 'whichever limit fires first' semantics and lets the existing PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT (128 MiB) actually take effect.

The pyiceberg writer has historically ignored write.parquet.row-group-size-bytes (logging 'not implemented') and used only write.parquet.row-group-limit (rows). For wide tables that means a single row group ends up at gigabytes — e.g. 337 cols × 1,048,576 default rows ≈ 1.7 GiB uncompressed per row group — which drives the polars / pyarrow reader's decode peak into the tens of GiB on production reads. Now write_file resolves row_group_size as min(row_group_limit, row_group_size_bytes / bytes_per_row), where bytes_per_row is approximated from the in-memory arrow_table's nbytes. This matches Spark / parquet-mr 'whichever limit fires first' semantics and lets the existing PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT (128 MiB) actually take effect.

stephrb force-pushed the sbuck/implement-row-group-size-bytes branch from e52ff2d to 421aafa Compare June 1, 2026 22:41

stephrb force-pushed the sbuck/implement-row-group-size-bytes branch from 421aafa to c21585e Compare June 1, 2026 22:47

stephrb closed this Jun 1, 2026

stephrb deleted the sbuck/implement-row-group-size-bytes branch June 1, 2026 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement write.parquet.row-group-size-bytes in the pyarrow writer#3449

Implement write.parquet.row-group-size-bytes in the pyarrow writer#3449
stephrb wants to merge 1 commit into
apache:mainfrom
imc-trading:sbuck/implement-row-group-size-bytes

stephrb commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stephrb commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant