components.embeddings
This page documents the embeddings.py
module, which provides functionality for embedding text and performing vector searches in a PostgreSQL database with pgvector
.
Classes
EmbeddingModelName
class EmbeddingModelName()
This class enumerates the embedding models we have the download details for.
Model | Version | Dimensions | Summary |
---|---|---|---|
Bidirectional Gated Encoder | Small | 384 | Efficient sentence embeddings for semantic similarity tasks. |
Sentence-BERT | MiniLM | 384 | Compact, optimized for sentence embeddings and semantic tasks. |
Generalizable T5 Retrieval | Base | 768 | Dual encoder for scalable, general-purpose semantic search. |
Generalizable T5 Retrieval | Large | 1024 | Enhanced version of GTR-T5 Base, ideal for large-scale tasks. |
Embedding Models for Search | Base | 768 | Dense multilingual embeddings for semantic search and retrieval. |
Embedding Models for Search | Large | 1024 | Larger model offering improved cross-domain retrieval performance. |
DistilBERT | Base Uncased | 768 | Smaller, faster BERT variant retaining high performance. |
DistilUSE | Base Multilingual | 512 | Efficient multilingual embeddings for cross-lingual tasks. |
Contriever | Contriever | 768 | Unsupervised dense retrieval model for zero-shot semantic search. |
EmbeddingModelInfo
class EmbeddingModelInfo()
A simple class to hold the information for embeddings models
EmbeddingModel
class EmbeddingModel()
A class to match the name of an embeddings model with the details required to download and use it.
PGVectorQuery
@component
class PGVectorQuery:
def __init__(
self,
embed_vocab: List[str] | None = None,
standard_concept: bool = False,
top_k: int = 5,
) -> None:
A haystack component for retrieving concept information using embeddings in a postgres database with pgvector
Parameters
embed_vocab
: Optional list of vocabulary IDs to filter resultsstandard_concept
: Whether to only return standard conceptstop_k
: Maximum number of results to return
run
@component.output_types(documents=List[Document])
def run(self, query_embedding: List[float])
Performs a vector similarity search in the database.
Parameters:
query_embedding
: List of floats representing the query embedding
Returns
Dictionary with key documents
containing a list of Haystack Document
objects with:
id
: Concept IDcontent
: Concept textscore
: Similarity score
Embeddings
class Embeddings:
def __init__(
self,
model_name: EmbeddingModelName,
embed_vocab: List[str] | None = None,
standard_concept: bool = False,
top_k: int = 5,
) -> None:
The main class for interacting with embeddings and vector search functionality. This class serves as an interface between the embeddings table of the OMOP-CDM database and the Haystack components pipeline.
Parameters
model_name
: The embedding model to useembed_vocab
: Optional list of vocabulary IDs to filter resultsstandard_concept
: Whether to only return standard conceptstop_k
: Maximum number of results to return
get_embedder
def get_embedder(self) -> FastembedTextEmbedder:
Creates and returns a FastembedTextEmbedder
instance configured with the selected model.
Get an embedder for queries in LLM pipelines
Returns
- A configured
FastembedTextEmbedder
instance ready to generate embeddings
get_retriever
def get_retriever(self) -> PGVectorQuery:
Creates and returns a PGVectorQuery instance for performing database searches.
Returns
A configured PGVectorQuery
instance
search
def search(
self,
query: List[str]
) -> List[List[Dict[str, Any]]]:
Search the attached vector database with a list of informal medications
Parameters
query: List[str]
A list of informal medication names
Returns
List[List[Dict[str, Any]]]
For each medication in the query, the result of searching the vector database. This is a nested list where each list contains dictionaries with:
concept_id
: ID of the matching conceptconcept
: Text of the matching conceptscore
: Similarity score
Functions
get_embedding_model
def get_embedding_model(
name: EmbeddingModelName
) -> EmbeddingModel:
Collects the details of an embedding model when given its name
Parameters
name: EmbeddingModelName
The name of an embedding model we have the details for
Returns
EmbeddingModel
An EmbeddingModel object containing the name and the details used