Skip to content

feat: support arrow pycapsule streams#3447

Open
abnobdoss wants to merge 2 commits into
apache:mainfrom
abnobdoss:arrow-pycapsule-stream
Open

feat: support arrow pycapsule streams#3447
abnobdoss wants to merge 2 commits into
apache:mainfrom
abnobdoss:arrow-pycapsule-stream

Conversation

@abnobdoss
Copy link
Copy Markdown

Closes #2680
Closes #1655

Rationale for this change

PyIceberg is coupled to PyArrow at its read/write boundary: append / overwrite reject anything that isn't a pa.Table / pa.RecordBatchReader, and external Arrow consumers can't read a table/scan without to_arrow(). Users of other Arrow-native libraries (polars, arro3, nanoarrow, …) therefore have to convert to PyArrow explicitly.

This PR adopts the Arrow PyCapsule interface on both sides:

import polars as pl

df = pl.DataFrame(table.scan())     # read: a scan is an Arrow producer
table.append(some_polars_frame)     # write: a polars/arro3/… frame is too

Native PyArrow inputs are unchanged; any other producer is imported as a streaming RecordBatchReader, so streaming is preserved. PyArrow stays an internal write dependency; this only removes the requirement that the caller use PyArrow.

One small writer-side adjustment falls out of this: bin-packing still prefers Arrow's logical nbytes estimate, but falls back to referenced buffer size for Arrow view types like string_view, which current Polars exports can produce and PyArrow cannot always size with nbytes.

Not in scope: upsert / dynamic_partition_overwrite still require a materialized pa.Table (they do random access / joins and don't accept a RecordBatchReader today). A PyCapsule producer to append / overwrite on a partitioned table raises NotImplementedError, the same restriction that already applies to pa.RecordBatchReader, since the producer is consumed as a reader and streaming writes to partitioned tables aren't supported. A materialized pa.Table is unaffected.

Are these changes tested?

Yes. tests/table/test_arrow_capsule.py (runs under make test, no Docker) covers coercion-helper branches; append over all input forms (pa.Table, reader, single- and multi-batch PyCapsule producers); overwrite with a producer; the native-pa.Table-on-partitioned regression; pa.table(table) / pa.table(table.scan()) round-trips plus filter/projection; and the dst.append(src.scan()) round-trip. tests/io/test_pyarrow.py covers the string_view bin-packing fallback.

Are there any user-facing changes?

Yes, additive and backwards compatible. append / overwrite accept Arrow PyCapsule producers (__arrow_c_stream__), and Table / DataScan implement __arrow_c_stream__ so they can be passed to any Arrow consumer (e.g. pa.table(...), polars.DataFrame(...)). No change for existing PyArrow inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow Arrow Capsule Interface [feature] Investigate integrations leveraging the PyCapsule protocol

1 participant