Skip to content

[FEATURE] Sparse ANN Concurrent Query Throughput Improvement #1684

@yuye-aws

Description

@yuye-aws

Is your feature request related to a problem?

In OpenSearch 3.3, the sparse ANN feature with Seismic algorithm was released, dramatically boosting sparse vector retrieval efficiency (see RFC and blog for details). However, under high concurrency workloads (8+ concurrent query threads), we've identified a significant performance bottleneck in the LRU cache implementation used by the sparse ANN algorithm.

The current AbstractLruCache implementation uses Collections.synchronizedMap(LinkedHashMap) to track access recency. This design creates severe lock contention because every cache access requires acquiring a global lock to update the access order in the LinkedHashMap. Under concurrent load, query threads spend significant time waiting for this lock, which limits throughput scaling and increases query latency.

What solution would you like?

Update the current implementation of Collections.synchronizedMap(LinkedHashMap) with new data structures like ConcurrentHashMap. At least, some benchmark results should be provided for the PR.

What alternatives have you considered?

  1. Allow users to disable the cache to record access order.
  2. Implement cache watermark to support batch eviction.

Do you have any additional context?

// single-thread
OpenSearch 'took' time statistics:
  Average: 2.36ms
  Min: 1ms
  Max: 9ms
  Median: 2ms
  P95: 4ms
  P99: 5ms

// 50-thread (slowed down by update access)
OpenSearch 'took' time statistics:
  Average: 36.81ms
  Min: 1ms
  Max: 205ms
  Median: 33ms
  P95: 79ms
  P99: 107ms

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions