Skip to content

[Python] Table.from_pylist on ExtensionType column with list_ storage crashes when values exceed int32 offsets #50012

@adrien-grl

Description

@adrien-grl

Describe the bug, including details regarding any error messages, version, and platform.

Bug description

When pa.Table.from_pylist is given a schema containing a pa.ExtensionType containing a pa.list_ field, and the cumulative values in that list field across rows exceed int32 max, the call fails with:

TypeError: Argument 'storage' has incorrect type (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray)

The message doesn't provide indication about the actual cause of the issue (for instance that it originates from the a pa.list_ or a pa.ExtensionType).

Environment

  • PyArrow 24.0.0
  • Python 3.12, Linux x86_64

Minimal steps to reproduce

The code requires roughly 3GB RAM.

import numpy as np
import pyarrow as pa

class FooExt(pa.ExtensionType):
    def __init__(self):
        super().__init__(
            pa.struct({"data": pa.list_(pa.uint8())}),
           "foo_img",
         )

    def __arrow_ext_serialize__(self):
        return b""

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        return cls()

pa.register_extension_type(FooExt())

schema = pa.schema({"img": FooExt()})

# 5 rows × 500M values = 2.5B > int32 max
arr = np.zeros(500_000_000, dtype=np.uint8)
rows = [{"img": {"data": arr}} for _ in range(5)]

pa.Table.from_pylist(rows, schema=schema)
# TypeError: Argument 'storage' has incorrect type
#            (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray)

Expected behavior

Either:

  1. An actionable error that names the column, identifies the int32-offset cause, and maybe even points at the escape routes (pa.large_list, smaller batches, or manual chunked construction), or
  2. A successful build that returns a ChunkedArray<ExtensionArray> whose chunks each fit in int32 offsets.

Component(s)

Python

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions