Skip to content

[Python] RecordBatch.from_pylist fails for large rows #48781

@benedikt-grl

Description

@benedikt-grl

Describe the bug, including details regarding any error messages, version, and platform.

When I try to create a RecordBatch from a list with large objects, RecordBatch.from_pylist raises a TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array.

MWE:

import pyarrow as pa
import numpy as np


# Create a random array of shape [3, 720, 1280]
rng = np.random.default_rng(42)
image = rng.integers(low=0, high=255, size=(3, 720, 1280))

# Wrap into dict
row = {
    "image": {
        "data": image.tobytes(),
        "shape": image.shape,
    }
}

# Define schema
schema = pa.schema({
    "image": pa.struct({"data": pa.binary(), "shape": pa.list_(pa.uint16(), 3)})
})

# Convert to record batch
num_rows = 98
pylist = [row] * num_rows
batch = pa.RecordBatch.from_pylist(pylist, schema=schema)

Traceback:

Traceback (most recent call last):
  File "mwe.py", line 22, in <module>
    batch = pa.RecordBatch.from_pylist(pylist, schema=schema)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 2049, in pyarrow.lib._Tabular.from_pylist
  File "pyarrow/table.pxi", line 6460, in pyarrow.lib._from_pylist
  File "pyarrow/table.pxi", line 3550, in pyarrow.lib.RecordBatch.from_arrays
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array

When num_rows is reduced to 97, the example above runs without any error.

I suspect the issue is related to the size in bytes of the pylist. Each image has 3 * 720 * 1280 * 8 bytes.
98 images have 2,167,603,200 bytes.
97 images have 2,145,484,800 bytes.
2^31 is 2,147,483,648 which is right in between these two numbers.

While in this MWE the images consume more bytes than needed, in my use case I cannot use fewer bytes.
Is there a simple way to solve this issue?

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions