Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified assets/full-text-search.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
61 changes: 35 additions & 26 deletions site/en/userGuide/search-query-get/full-text-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,43 +16,45 @@ By integrating full text search with semantic-based dense vector search, you can

</div>

## Overview
## BM25 implementation

Full text search simplifies the process of text-based searching by eliminating the need for manual embedding. This feature operates through the following workflow:
Milvus provides full text search powered by the BM25 relevance algorithm, a widely adopted scoring function in information retrieval systems, and Milvus integrates it into the search workflow to deliver accurate, relevance-ranked text results.

1. **Text input**: You insert raw text documents or provide query text without any need for manual embedding.
Full text search in Milvus follows the workflow below:

1. **Text analysis**: Milvus uses an [analyzer](analyzer-overview.md) to tokenize input text into individual, searchable terms.
1. **Raw text input**: You insert text documents or provide a query using plain text, no embedding models required.

1. **Function processing**: The built-in function receives tokenized terms and converts them into sparse vector representations.
1. **Text analysis**: Milvus uses an [analyzer](analyzer-overview.md) to process your text into meaningful terms that can be indexed and searched.

1. **Collection store**: Milvus stores these sparse embeddings in a collection for efficient retrieval.
1. **BM25 function processing**: A built-in function transforms these terms into sparse vector representations optimized for BM25 scoring.

1. **BM25 scoring**: During a search, Milvus applies the BM25 algorithm to calculate scores for the stored documents and ranks matched results based on relevance to the query text.
1. **Collection store**: Milvus stores the resulting sparse embeddings in a collection for fast retrieval and ranking.

1. **BM25 relevance scoring**: At search time, Milvus applies the BM25 scoring function to compute document relevance and return ranked results that best match the query terms.

![Full Text Search](../../../../assets/full-text-search.png)

To use full text search, follow these main steps:

1. [Create a collection](full-text-search.md#Create-a-collection-for-full-text-search): Set up a collection with necessary fields and define a function to convert raw text into sparse embeddings.
1. [Create a collection](full-text-search.md#Create-a-collection-for-BM25-full-text-search): Set up the required fields and define a BM25 function that converts raw text into sparse embeddings.

1. [Insert data](full-text-search.md#Insert-text-data): Ingest your raw text documents to the collection.

1. [Perform searches](full-text-search.md#Perform-full-text-search): Use query texts to search through your collection and retrieve relevant results.
1. [Perform searches](full-text-search.md#Perform-full-text-search): Use natural-language query text to retrieve ranked results based on BM25 relevance.

## Create a collection for full text search
## Create a collection for BM25 full text search

To enable full text search, create a collection with a specific schema. This schema must include three necessary fields:
To enable BM25-powered full text search, you must prepare a collection with the required fields, define a BM25 function to generate sparse vectors, configure an index, and then create the collection.

- The primary field that uniquely identifies each entity in a collection.
### Define schema fields

- A `VARCHAR` field that stores raw text documents, with the `enable_analyzer` attribute set to `True`. This allows Milvus to tokenize text into specific terms for function processing.
Your collection schema must include at least three required fields:

- A `SPARSE_FLOAT_VECTOR` field reserved to store sparse embeddings that Milvus will automatically generate for the `VARCHAR` field.
- **Primary field**: Uniquely identifies each entity in the collection.

### Define the collection schema
- **Text field** (`VARCHAR`): Stores raw text documents. Must set `enable_analyzer=True` so Milvus can process the text for BM25 relevance ranking. By default, Milvus uses the [`standard`](standard-analyzer.md)[ analyzer](standard-analyzer.md) for text analysis. To configure a different analyzer, refer to [Analyzer Overview](analyzer-overview.md).

First, create the schema and add the necessary fields:
- **Sparse vector field** (`SPARSE_FLOAT_VECTOR`): Stores sparse embeddings automatically generated by the BM25 function.

<div class="multipleCode">
<a href="#python">Python</a>
Expand All @@ -72,9 +74,11 @@ client = MilvusClient(

schema = client.create_schema()

schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True)
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR)
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True) # Primary field
# highlight-start
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True) # Text field
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR) # Sparse vector field; no dim required for sparse vectors
# highlight-end
```

```java
Expand Down Expand Up @@ -197,15 +201,19 @@ export schema='{
}'
```

In this configuration,
In the preceding config,

- `id`: serves as the primary key and is automatically generated with `auto_id=True`.

- `text`: stores your raw text data for full text search operations. The data type must be `VARCHAR`, as `VARCHAR` is Milvus string data type for text storage. Set `enable_analyzer=True` to allow Milvus to tokenize the text. By default, Milvus uses the `standard`[ analyzer](standard-analyzer.md) for text analysis. To configure a different analyzer, refer to [Analyzer Overview](analyzer-overview.md).
- `text`: stores your raw text data for full text search operations. The data type must be `VARCHAR`, as `VARCHAR` is Milvus string data type for text storage.

- `sparse`: a vector field reserved to store internally generated sparse embeddings for full text search operations. The data type must be `SPARSE_FLOAT_VECTOR`.

Now, define a function that will convert your text into sparse vector representations and then add it to the schema:
### Define the BM25 function

The BM25 function converts tokenized text into sparse vectors that support BM25 scoring.

Define the function and add it to your schema:

<div class="multipleCode">
<a href="#python">Python</a>
Expand All @@ -220,6 +228,7 @@ bm25_function = Function(
name="text_bm25_emb", # Function name
input_field_names=["text"], # Name of the VARCHAR field containing raw text data
output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
# highlight-next-line
function_type=FunctionType.BM25, # Set to `BM25`
)

Expand Down Expand Up @@ -304,7 +313,7 @@ export schema='{
</tr>
<tr>
<td><p><code>name</code></p></td>
<td><p>The name of the function. This function converts your raw text from the <code>text</code> field into searchable vectors that will be stored in the <code>sparse</code> field.</p></td>
<td><p>The name of the function. This function converts your raw text from the <code>text</code> field into BM25-compatible sparse vectors that will be stored in the <code>sparse</code> field.</p></td>
</tr>
<tr>
<td><p><code>input_field_names</code></p></td>
Expand All @@ -316,19 +325,19 @@ export schema='{
</tr>
<tr>
<td><p><code>function_type</code></p></td>
<td><p>The type of the function to use. Set the value to <code>FunctionType.BM25</code>.</p></td>
<td><p>The type of the function to use. Must be <code>FunctionType.BM25</code>.</p></td>
</tr>
</table>

<div class="alert note">

For collections with multiple `VARCHAR` fields requiring text-to-sparse-vector conversion, add separate functions to the collection schema, ensuring each function has a unique name and `output_field_names` value.
If multiple `VARCHAR` fields require BM25 processing, define **one BM25 function per field**, each with a unique name and output field.

</div>

### Configure the index

After defining the schema with necessary fields and the built-in function, set up the index for your collection. To simplify this process, use `AUTOINDEX` as the `index_type`, an option that allows Milvus to choose and configure the most suitable index type based on the structure of your data.
After defining the schema with necessary fields and the built-in function, set up the index for your collection.

<div class="multipleCode">
<a href="#python">Python</a>
Expand Down