Describe the bug, including details regarding any error messages, version, and platform.
Bug description
When pa.Table.from_pylist is given a schema containing a pa.ExtensionType containing a pa.list_ field, and the cumulative values in that list field across rows exceed int32 max, the call fails with:
TypeError: Argument 'storage' has incorrect type (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray)
The message doesn't provide indication about the actual cause of the issue (for instance that it originates from the a pa.list_ or a pa.ExtensionType).
Environment
- PyArrow 24.0.0
- Python 3.12, Linux x86_64
Minimal steps to reproduce
The code requires roughly 3GB RAM.
import numpy as np
import pyarrow as pa
class FooExt(pa.ExtensionType):
def __init__(self):
super().__init__(
pa.struct({"data": pa.list_(pa.uint8())}),
"foo_img",
)
def __arrow_ext_serialize__(self):
return b""
@classmethod
def __arrow_ext_deserialize__(cls, storage_type, serialized):
return cls()
pa.register_extension_type(FooExt())
schema = pa.schema({"img": FooExt()})
# 5 rows × 500M values = 2.5B > int32 max
arr = np.zeros(500_000_000, dtype=np.uint8)
rows = [{"img": {"data": arr}} for _ in range(5)]
pa.Table.from_pylist(rows, schema=schema)
# TypeError: Argument 'storage' has incorrect type
# (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray)
Expected behavior
Either:
- An actionable error that names the column, identifies the int32-offset cause, and maybe even points at the escape routes (
pa.large_list, smaller batches, or manual chunked construction), or
- A successful build that returns a
ChunkedArray<ExtensionArray> whose chunks each fit in int32 offsets.
Component(s)
Python
Describe the bug, including details regarding any error messages, version, and platform.
Bug description
When
pa.Table.from_pylistis given a schema containing apa.ExtensionTypecontaining apa.list_field, and the cumulative values in that list field across rows exceed int32 max, the call fails with:The message doesn't provide indication about the actual cause of the issue (for instance that it originates from the a
pa.list_or apa.ExtensionType).Environment
Minimal steps to reproduce
The code requires roughly 3GB RAM.
Expected behavior
Either:
pa.large_list, smaller batches, or manual chunked construction), orChunkedArray<ExtensionArray>whose chunks each fit in int32 offsets.Component(s)
Python