feat: support arrow pycapsule streams by abnobdoss · Pull Request #3447 · apache/iceberg-python

abnobdoss · 2026-05-31T20:45:15Z

Closes #2680
Closes #1655

Rationale for this change

PyIceberg is coupled to PyArrow at its read/write boundary: append / overwrite reject anything that isn't a pa.Table / pa.RecordBatchReader, and external Arrow consumers can't read a table/scan without to_arrow(). Users of other Arrow-native libraries (polars, arro3, nanoarrow, …) therefore have to convert to PyArrow explicitly.

This PR adopts the Arrow PyCapsule interface on both sides:

Input (Allow Arrow Capsule Interface #2680): append / overwrite accept any object implementing __arrow_c_stream__, in addition to PyArrow types.
Output ([feature] Investigate integrations leveraging the PyCapsule protocol #1655): Table and DataScan implement __arrow_c_stream__, so they can be handed to any Arrow consumer.

import polars as pl

df = pl.DataFrame(table.scan())     # read: a scan is an Arrow producer
table.append(some_polars_frame)     # write: a polars/arro3/… frame is too

Native PyArrow inputs are unchanged; any other producer is imported as a streaming RecordBatchReader, so streaming is preserved. PyArrow stays an internal write dependency; this only removes the requirement that the caller use PyArrow.

One small writer-side adjustment falls out of this: bin-packing still prefers Arrow's logical nbytes estimate, but falls back to referenced buffer size for Arrow view types like string_view, which current Polars exports can produce and PyArrow cannot always size with nbytes.

Not in scope: upsert / dynamic_partition_overwrite still require a materialized pa.Table (they do random access / joins and don't accept a RecordBatchReader today). A PyCapsule producer to append / overwrite on a partitioned table raises NotImplementedError, the same restriction that already applies to pa.RecordBatchReader, since the producer is consumed as a reader and streaming writes to partitioned tables aren't supported. A materialized pa.Table is unaffected.

Are these changes tested?

Yes. tests/table/test_arrow_capsule.py (runs under make test, no Docker) covers coercion-helper branches; append over all input forms (pa.Table, reader, single- and multi-batch PyCapsule producers); overwrite with a producer; the native-pa.Table-on-partitioned regression; pa.table(table) / pa.table(table.scan()) round-trips plus filter/projection; and the dst.append(src.scan()) round-trip. tests/io/test_pyarrow.py covers the string_view bin-packing fallback.

Are there any user-facing changes?

Yes, additive and backwards compatible. append / overwrite accept Arrow PyCapsule producers (__arrow_c_stream__), and Table / DataScan implement __arrow_c_stream__ so they can be passed to any Arrow consumer (e.g. pa.table(...), polars.DataFrame(...)). No change for existing PyArrow inputs.

Abanoub Doss added 2 commits May 31, 2026 15:42

feat: support arrow pycapsule streams

e1952d8

test: update arrow input error expectations

271f50d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support arrow pycapsule streams#3447

feat: support arrow pycapsule streams#3447
abnobdoss wants to merge 2 commits into
apache:mainfrom
abnobdoss:arrow-pycapsule-stream

abnobdoss commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abnobdoss commented May 31, 2026

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant