Skip to content

Spanner loader is flaky - it can abort reading all partitions before all the data is read. #102

@nielm

Description

@nielm

The code in src/langchain_google_spanner/loader.py#L230 is incorrect

Specifically:

    for partition in partitions:
      r = snapshot.process_query_batch(partition)
      results = r.to_dict_list()
      if len(results) == 0:
        break
      # ... process rows in results

That break should be a continue so that the remaining partitions are read.

Reason being that this is a partitioned query (see https://cloud.google.com/spanner/docs/reads#read_data_in_parallel) which requires you to read all the returned partitions.

Some of these partitions may be empty, some may have data, it depends on how much data is in your tables, the query, and some Spanner internals.

This code aborts iterating through the partitions as soon as it finds the first empty partition

An additional (minor) error in this code is that the database snapshot is never closed, which is a resource leak.

Another is that this code will fail if the caller takes too long (more than 1hr) iterating over the returned values, due to some of the partitions not having been read and timing out...

Note that this partitioned query API is not that useful in this use case. It is meant to be used in a multi-threaded or multi-machine/multi-process environment (eg Dataflow/Flume) where partitions of data are processed in parallel. The code here simply processes the partitions in series (aborting if one is empty).

There is a much simpler way of querying the database, which simply returns a single ResultSet: https://cloud.google.com/python/docs/reference/spanner/latest/snapshot-usage#execute-a-sql-select-statement (see the top of the page for using a snapshot with a fixed staleness timestamp)

I recommend removing all partitioned queries from the spanner loader code, and use simple queries.

Metadata

Metadata

Assignees

Labels

api: spannerIssues related to the googleapis/langchain-google-spanner-python API.priority: p1Important issue which blocks shipping the next release. Will be fixed prior to next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions