Skip to content

Implement write.parquet.row-group-size-bytes in the pyarrow writer#3449

Closed
stephrb wants to merge 1 commit into
apache:mainfrom
imc-trading:sbuck/implement-row-group-size-bytes
Closed

Implement write.parquet.row-group-size-bytes in the pyarrow writer#3449
stephrb wants to merge 1 commit into
apache:mainfrom
imc-trading:sbuck/implement-row-group-size-bytes

Conversation

@stephrb
Copy link
Copy Markdown

@stephrb stephrb commented Jun 1, 2026

The pyiceberg writer has historically ignored
write.parquet.row-group-size-bytes (logging 'not implemented') and used only write.parquet.row-group-limit (rows). For wide tables that means a single row group ends up at gigabytes — e.g. 337 cols × 1,048,576 default rows ≈ 1.7 GiB uncompressed per row group — which drives the polars / pyarrow reader's decode peak into the tens of GiB on production reads.

Now write_file resolves row_group_size as
min(row_group_limit, row_group_size_bytes / bytes_per_row), where bytes_per_row is approximated from the in-memory arrow_table's nbytes. This matches Spark / parquet-mr 'whichever limit fires first' semantics and lets the existing PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT (128 MiB) actually take effect.

@stephrb stephrb force-pushed the sbuck/implement-row-group-size-bytes branch from e52ff2d to 421aafa Compare June 1, 2026 22:41
The pyiceberg writer has historically ignored
write.parquet.row-group-size-bytes (logging 'not implemented') and used
only write.parquet.row-group-limit (rows). For wide tables that means a
single row group ends up at gigabytes — e.g. 337 cols × 1,048,576 default
rows ≈ 1.7 GiB uncompressed per row group — which drives the polars /
pyarrow reader's decode peak into the tens of GiB on production reads.

Now write_file resolves row_group_size as
min(row_group_limit, row_group_size_bytes / bytes_per_row), where
bytes_per_row is approximated from the in-memory arrow_table's nbytes.
This matches Spark / parquet-mr 'whichever limit fires first' semantics
and lets the existing PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT (128 MiB)
actually take effect.
@stephrb stephrb force-pushed the sbuck/implement-row-group-size-bytes branch from 421aafa to c21585e Compare June 1, 2026 22:47
@stephrb stephrb closed this Jun 1, 2026
@stephrb stephrb deleted the sbuck/implement-row-group-size-bytes branch June 1, 2026 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant