FAISS and sentence-transformers in 5 Minutes

FAISS is an very efficient library for efficient similarity search and clustering of dense vectors. It is designed to handle very large search spaces efficiently, making it ideal for tasks like semantic search or recommendation systems.

We can use it in conjunction with sentence-transformers, a Python library that provides pre-trained models to generate embeddings for sentences. These embeddings capture the semantic meaning of sentences and enable various applications like semantic search, clustering, and classification.

The essense of the Hiarchical Navigable Small World (HNSW) algorithm is to build a graph of the data points, and then use this graph to navigate to the nearest neighbors. It does this by first finding the nearest neighbor, and then using this neighbor to find the next nearest neighbor, and so on. This is repeated until the entire graph is traversed.

To get started, let's install the required packages. You can do this using pip:

pip install faiss-cpu sentence-transformers

First, we need to generate sentence embeddings using a pre-trained model from sentence-transformers. For this tutorial, we’ll use the paraphrase-MiniLM-L6-v2 model, which provides a decent balance between performance and efficiency.

from sentence_transformers import SentenceTransformer

# Load the pre-trained model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Sample sentences
sentences = [
    "The rain in Spain falls mainly on the plain",
    "The tesselated polygon is a special type of polygon",
    "To be or not to be, that is the question"
    "It is a truth universally acknowledged...",
    "The goat ran down the hill"
]

# Generate embeddings for the sentences
embeddings = model.encode(sentences)
print(f"Embeddings shape: {embeddings.shape}")

Next, we will set up FAISS to perform the similarity search. FAISS requires the input data to be in the form of dense vectors (which we have just generated).

import faiss

# Dimensions of our embeddings
d = embeddings.shape[1]

# Creating an index for our dense vectors
index = faiss.IndexFlatL2(d)  # Using L2 (Euclidean) distance

# Adding the embeddings to the index
index.add(embeddings)

print(f"Total sentences indexed: {index.ntotal}")

With the embeddings indexed, we can now perform a semantic search. Given a query sentence, we encode it, search for the closest vectors in the FAISS index, and retrieve the most similar sentences.

# Define a query sentence
query_sentence = "A fox swiftly jumps over a sleepy dog"
query_embedding = model.encode([query_sentence])

# Perform the search
k = 2  # Number of nearest neighbors to retrieve
distances, indices = index.search(query_embedding, k)

# Display the results
print(f"Query: {query_sentence}")

print("Most similar sentences:")
for i, idx in enumerate(indices[0]):
    print(f"{i + 1}: {sentences[idx]} (Distance: {distances[0][i]})")

Nowe if wwant to use this in a real-world application, we can use the faiss-cpu library to store the embeddings in a file and load them back into memory when we need to perform a search. To do this we can create a dicitionary that uses embeddings as the indexing method. First we set up the model.

from functools import cache
from typing import Iterator, Mapping, TypeVar, cast

import torch
from sentence_transformers import SentenceTransformer

T = TypeVar("T")

EMBEDDING_MODEL = "all-MiniLM-L6-v2"


@cache
def _model():
    model = SentenceTransformer(EMBEDDING_MODEL)
    return model

We can then create a class that uses this model to perform the embedding and similarity search.

class VectorMap(Mapping[str, T]):
    """
    A dictionary that uses sentence-transformers to do vector comparison on
    keys
    """

    def __init__(self, data: Mapping[str, T], threshold: float = 0.7):
        self._data: list[T] = list(data.values())
        self._keys: list[str] = list(data.keys())
        self._model = _model()
        self._embeddings = self._model.encode(self._keys, convert_to_tensor=True)
        self._length = len(data)
        self._threshold = threshold

    def __getitem__(self, key) -> T:
        top_k = min(5, self._length)
        query_embedding = self._model.encode(key, convert_to_tensor=True)
        similarity_scores = self._model.similarity(query_embedding, self._embeddings)[0]
        scores, indices = torch.topk(similarity_scores, k=top_k)
        for score, index in zip(scores, indices):
            if score > self._threshold:
                return cast(T, self._data[index])
        raise KeyError(key)

    def __contains__(self, key) -> bool:
        query_embedding = self._model.encode(key, convert_to_tensor=True)
        scores = self._model.similarity_scores(query_embedding, self._embeddings)
        return any(score > self._threshold for score in scores)

    def __iter__(self) -> Iterator[str]:
        return iter(self._keys)

    def __len__(self):
        return self._length

    def __repr__(self):
        return "VectorMap()"

To use this class, we can create an instance and pass in a dictionary of data to index.

data = {
    "Hello": "Greeting",
    "I am a software engineer": "Occupation",
    "I like to play football": "Hobby",
    "The weather is nice today": "Weather"
}

vector_map = VectorMap(data)

print(vector_map["I am a data scientist"])
print(vector_map["I play soccer"])