Skip to content

The top_k implementation does not scale well with large volumes of data #172

@jordi-mas-dj

Description

@jordi-mas-dj

Environment details

  • OS type and version: Linux
  • Python version: 3.10
  • langchain-google-spanner version: 0.82

Problem

top_k in is defined as one of the init params in SpannerGraphQAChain class:

top_k: int = 10
"""Restricts the number of results returned in the graph query."""

And it is implemented as an array slice in method execute_query:

 return self.graph.query(gql_query)[: self.top_k]

In scenarios when in your results there are millions of results (e.g. show me all the companies in Spain with more than 10 employees), the full result set is fetched into memory before slicing, which it is not performance wise.

We use ORDER and LIMIT in our samples and query then this issue now is not impacting us.

Suggestions

On how to solve it:

  • If you are keeping this as it is, will be worth to improve the comment ( """Restricts the number of re....) and explain that this is an array slice after the query has return all results or similar
  • In my view, a more scalable way to implement this is using GQL LIMIT
  • It is important to notice that any solution LIMIT, array slice, etc, should be based on the premise than the results are ordered. May be a good idea to add this to top_k documentation.

Metadata

Metadata

Labels

api: spannerIssues related to the googleapis/langchain-google-spanner-python API.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions