The top_k implementation does not scale well with large volumes of data


#### Environment details

  - OS type and version: Linux
  - Python version: 3.10
  - `langchain-google-spanner` version: 0.82

#### Problem

_top_k_ in is defined as one of the init params in _SpannerGraphQAChain_ class:

```python
top_k: int = 10
"""Restricts the number of results returned in the graph query."""
```

And it is implemented as an array slice in method _execute_query_:

```python
 return self.graph.query(gql_query)[: self.top_k]
```
In scenarios when in your results there are millions of results (e.g. show me all the companies in Spain with more than 10 employees), the full result set is fetched into memory before slicing, which it is not performance wise. 

We use ORDER and LIMIT in our samples and query then this issue now is not impacting us.

#### Suggestions

On how to solve it:

- If you are keeping this as it is, will be worth to improve the comment ( """Restricts the number of re....) and explain that this is an array slice after the query has return all results or similar
- In my view, a more scalable way to implement this is using GQL LIMIT
- It is important to notice that any solution LIMIT, array slice, etc, should be based on the premise than the results are ordered. May be a good idea to add this to top_k documentation.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The top_k implementation does not scale well with large volumes of data #172

Environment details

Problem

Suggestions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The top_k implementation does not scale well with large volumes of data #172

Description

Environment details

Problem

Suggestions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions