components.embeddings
This page documents the embeddings.py module, which provides functionality for embedding text and performing vector searches in a PostgreSQL database with pgvector.
Classes
EmbeddingModelName
class EmbeddingModelName()This class enumerates the embedding models we have the download details for.
| Model | Version | Dimensions | Summary |
|---|---|---|---|
| Bidirectional Gated Encoder | Small | 384 | Efficient sentence embeddings for semantic similarity tasks. |
| Sentence-BERT | MiniLM | 384 | Compact, optimized for sentence embeddings and semantic tasks. |
| Generalizable T5 Retrieval | Base | 768 | Dual encoder for scalable, general-purpose semantic search. |
| Generalizable T5 Retrieval | Large | 1024 | Enhanced version of GTR-T5 Base, ideal for large-scale tasks. |
| Embedding Models for Search | Base | 768 | Dense multilingual embeddings for semantic search and retrieval. |
| Embedding Models for Search | Large | 1024 | Larger model offering improved cross-domain retrieval performance. |
| DistilBERT | Base Uncased | 768 | Smaller, faster BERT variant retaining high performance. |
| DistilUSE | Base Multilingual | 512 | Efficient multilingual embeddings for cross-lingual tasks. |
| Contriever | Contriever | 768 | Unsupervised dense retrieval model for zero-shot semantic search. |
EmbeddingModelInfo
class EmbeddingModelInfo()A simple class to hold the information for embeddings models
EmbeddingModel
class EmbeddingModel()A class to match the name of an embeddings model with the details required to download and use it.
PGVectorQuery
@component
class PGVectorQuery:
def __init__(
self,
embed_vocab: List[str] | None = None,
standard_concept: bool = False,
top_k: int = 5,
) -> None:A haystack component for retrieving concept information using embeddings in a postgres database with pgvector
Parameters
embed_vocab: Optional list of vocabulary IDs to filter resultsstandard_concept: Whether to only return standard conceptstop_k: Maximum number of results to return
run
@component.output_types(documents=List[Document])
def run(self, query_embedding: List[float])Performs a vector similarity search in the database.
Parameters:
query_embedding: List of floats representing the query embedding
Returns
Dictionary with key documents containing a list of Haystack Document objects with:
id: Concept IDcontent: Concept textscore: Similarity score
Embeddings
class Embeddings:
def __init__(
self,
model_name: EmbeddingModelName,
embed_vocab: List[str] | None = None,
standard_concept: bool = False,
top_k: int = 5,
) -> None:The main class for interacting with embeddings and vector search functionality. This class serves as an interface between the embeddings table of the OMOP-CDM database and the Haystack components pipeline.
Parameters
model_name: The embedding model to useembed_vocab: Optional list of vocabulary IDs to filter resultsstandard_concept: Whether to only return standard conceptstop_k: Maximum number of results to return
get_embedder
def get_embedder(self) -> FastembedTextEmbedder:Creates and returns a FastembedTextEmbedder instance configured with the selected model.
Get an embedder for queries in LLM pipelines
Returns
- A configured
FastembedTextEmbedderinstance ready to generate embeddings
get_retriever
def get_retriever(self) -> PGVectorQuery:Creates and returns a PGVectorQuery instance for performing database searches.
Returns
A configured PGVectorQuery instance
search
def search(
self,
query: List[str]
) -> List[List[Dict[str, Any]]]:Search the attached vector database with a list of informal medications
Parameters
query: List[str]
A list of informal medication names
Returns
List[List[Dict[str, Any]]]
For each medication in the query, the result of searching the vector database. This is a nested list where each list contains dictionaries with:
concept_id: ID of the matching conceptconcept: Text of the matching conceptscore: Similarity score
Functions
get_embedding_model
def get_embedding_model(
name: EmbeddingModelName
) -> EmbeddingModel:Collects the details of an embedding model when given its name
Parameters
name: EmbeddingModelName
The name of an embedding model we have the details for
Returns
EmbeddingModel
An EmbeddingModel object containing the name and the details used