components.embeddings

source

This page documents the embeddings.py module, which provides functionality for embedding text and performing vector searches in a PostgreSQL database with pgvector.

Classes

EmbeddingModelName

class EmbeddingModelName()

This class enumerates the embedding models we have the download details for.

ModelVersionDimensionsSummary
Bidirectional Gated EncoderSmall384Efficient sentence embeddings for semantic similarity tasks.
Sentence-BERTMiniLM384Compact, optimized for sentence embeddings and semantic tasks.
Generalizable T5 RetrievalBase768Dual encoder for scalable, general-purpose semantic search.
Generalizable T5 RetrievalLarge1024Enhanced version of GTR-T5 Base, ideal for large-scale tasks.
Embedding Models for SearchBase768Dense multilingual embeddings for semantic search and retrieval.
Embedding Models for SearchLarge1024Larger model offering improved cross-domain retrieval performance.
DistilBERTBase Uncased768Smaller, faster BERT variant retaining high performance.
DistilUSEBase Multilingual512Efficient multilingual embeddings for cross-lingual tasks.
ContrieverContriever768Unsupervised dense retrieval model for zero-shot semantic search.

EmbeddingModelInfo

class EmbeddingModelInfo()

A simple class to hold the information for embeddings models

EmbeddingModel

class EmbeddingModel()

A class to match the name of an embeddings model with the details required to download and use it.

PGVectorQuery

@component
class PGVectorQuery:
    def __init__(
		self,
		embed_vocab: List[str] | None = None,
		standard_concept: bool = False,
		top_k: int = 5,
	) -> None:

A haystack component for retrieving concept information using embeddings in a postgres database with pgvector

Parameters

  • embed_vocab: Optional list of vocabulary IDs to filter results
  • standard_concept: Whether to only return standard concepts
  • top_k: Maximum number of results to return

run

@component.output_types(documents=List[Document])
def run(self, query_embedding: List[float])

Performs a vector similarity search in the database.

Parameters:
  • query_embedding: List of floats representing the query embedding
Returns

Dictionary with key documents containing a list of Haystack Document objects with:

  • id: Concept ID
  • content: Concept text
  • score: Similarity score

Embeddings

class Embeddings:
    def __init__(
        self,
        model_name: EmbeddingModelName,
        embed_vocab: List[str] | None = None,
        standard_concept: bool = False,
        top_k: int = 5,
    ) -> None:

The main class for interacting with embeddings and vector search functionality. This class serves as an interface between the embeddings table of the OMOP-CDM database and the Haystack components pipeline.

Parameters

  • model_name: The embedding model to use
  • embed_vocab: Optional list of vocabulary IDs to filter results
  • standard_concept: Whether to only return standard concepts
  • top_k: Maximum number of results to return

get_embedder

def get_embedder(self) -> FastembedTextEmbedder:

Creates and returns a FastembedTextEmbedder instance configured with the selected model. Get an embedder for queries in LLM pipelines

Returns
  • A configured FastembedTextEmbedder instance ready to generate embeddings

get_retriever

def get_retriever(self) -> PGVectorQuery:

Creates and returns a PGVectorQuery instance for performing database searches.

Returns

A configured PGVectorQuery instance

def search(
	self, 
	query: List[str]
) -> List[List[Dict[str, Any]]]:

Search the attached vector database with a list of informal medications

Parameters

query: List[str] A list of informal medication names

Returns

List[List[Dict[str, Any]]] For each medication in the query, the result of searching the vector database. This is a nested list where each list contains dictionaries with:

  • concept_id: ID of the matching concept
  • concept: Text of the matching concept
  • score: Similarity score

Functions

get_embedding_model

def get_embedding_model(
	name: EmbeddingModelName
) -> EmbeddingModel:

Collects the details of an embedding model when given its name

Parameters

name: EmbeddingModelName The name of an embedding model we have the details for

Returns

EmbeddingModel An EmbeddingModel object containing the name and the details used