`components.embeddings`

source

This page documents the embeddings.py module, which provides functionality for embedding text and performing vector searches in a PostgreSQL database with pgvector.

Classes

`EmbeddingModelName`

class EmbeddingModelName()

This class enumerates the embedding models we have the download details for.

Model	Version	Dimensions	Summary
Bidirectional Gated Encoder	Small	384	Efficient sentence embeddings for semantic similarity tasks.
Sentence-BERT	MiniLM	384	Compact, optimized for sentence embeddings and semantic tasks.
Generalizable T5 Retrieval	Base	768	Dual encoder for scalable, general-purpose semantic search.
Generalizable T5 Retrieval	Large	1024	Enhanced version of GTR-T5 Base, ideal for large-scale tasks.
Embedding Models for Search	Base	768	Dense multilingual embeddings for semantic search and retrieval.
Embedding Models for Search	Large	1024	Larger model offering improved cross-domain retrieval performance.
DistilBERT	Base Uncased	768	Smaller, faster BERT variant retaining high performance.
DistilUSE	Base Multilingual	512	Efficient multilingual embeddings for cross-lingual tasks.
Contriever	Contriever	768	Unsupervised dense retrieval model for zero-shot semantic search.

`EmbeddingModelInfo`

class EmbeddingModelInfo()

A simple class to hold the information for embeddings models

`EmbeddingModel`

class EmbeddingModel()

A class to match the name of an embeddings model with the details required to download and use it.

`PGVectorQuery`

@component
class PGVectorQuery:
    def __init__(
		self,
		embed_vocab: List[str] | None = None,
		standard_concept: bool = False,
		top_k: int = 5,
	) -> None:

A haystack component for retrieving concept information using embeddings in a postgres database with pgvector

Parameters

embed_vocab: Optional list of vocabulary IDs to filter results
standard_concept: Whether to only return standard concepts
top_k: Maximum number of results to return

`run`

@component.output_types(documents=List[Document])
def run(self, query_embedding: List[float])

Performs a vector similarity search in the database.

Parameters:

query_embedding: List of floats representing the query embedding

Returns

Dictionary with key documents containing a list of Haystack Document objects with:

id: Concept ID
content: Concept text
score: Similarity score

`Embeddings`

class Embeddings:
    def __init__(
        self,
        model_name: EmbeddingModelName,
        embed_vocab: List[str] | None = None,
        standard_concept: bool = False,
        top_k: int = 5,
    ) -> None:

The main class for interacting with embeddings and vector search functionality. This class serves as an interface between the embeddings table of the OMOP-CDM database and the Haystack components pipeline.

Parameters

model_name: The embedding model to use
embed_vocab: Optional list of vocabulary IDs to filter results
standard_concept: Whether to only return standard concepts
top_k: Maximum number of results to return

`get_embedder`

def get_embedder(self) -> FastembedTextEmbedder:

Creates and returns a FastembedTextEmbedder instance configured with the selected model. Get an embedder for queries in LLM pipelines

Returns

A configured FastembedTextEmbedder instance ready to generate embeddings

`get_retriever`

def get_retriever(self) -> PGVectorQuery:

Creates and returns a PGVectorQuery instance for performing database searches.

Returns

A configured PGVectorQuery instance

`search`

def search(
	self, 
	query: List[str]
) -> List[List[Dict[str, Any]]]:

Search the attached vector database with a list of informal medications

Parameters

query: List[str] A list of informal medication names

Returns

List[List[Dict[str, Any]]] For each medication in the query, the result of searching the vector database. This is a nested list where each list contains dictionaries with:

concept_id: ID of the matching concept
concept: Text of the matching concept
score: Similarity score

Functions

`get_embedding_model`

def get_embedding_model(
	name: EmbeddingModelName
) -> EmbeddingModel:

Collects the details of an embedding model when given its name

Parameters

name: EmbeddingModelName The name of an embedding model we have the details for

Returns

EmbeddingModel An EmbeddingModel object containing the name and the details used

Components Models