components.embeddings

source

EmbeddingModelName

class EmbeddingModelName()

This class enumerates the embedding models we have the download details for.

ModelVersionDimensionsSummary
Bidirectional Gated EncoderSmall384Efficient sentence embeddings for semantic similarity tasks.
Sentence-BERTMiniLM384Compact, optimized for sentence embeddings and semantic tasks.
Generalizable T5 RetrievalBase768Dual encoder for scalable, general-purpose semantic search.
Generalizable T5 RetrievalLarge1024Enhanced version of GTR-T5 Base, ideal for large-scale tasks.
Embedding Models for SearchBase768Dense multilingual embeddings for semantic search and retrieval.
Embedding Models for SearchLarge1024Larger model offering improved cross-domain retrieval performance.
DistilBERTBase Uncased768Smaller, faster BERT variant retaining high performance.
DistilUSEBase Multilingual512Efficient multilingual embeddings for cross-lingual tasks.
ContrieverContriever768Unsupervised dense retrieval model for zero-shot semantic search.

EmbeddingModelInfo

class EmbeddingModelInfo()

A simple class to hold the information for embeddings models

EmbeddingModel

class EmbeddingModel()

A class to match the name of an embeddings model with the details required to download and use it.

get_embedding_model

def get_embedding_model(
	name: EmbeddingModelName
)

Collects the details of an embedding model when given its name

Parameters

name: EmbeddingModelName The name of an embedding model we have the details for

Returns

EmbeddingModel An EmbeddingModel object containing the name and the details used

Embeddings

class Embeddings(
	embeddings_path: str
	force_rebuild: bool
	embed_vocab: str[List]
	model_name: EmbeddingModelName
	search_kwargs: dict
)

This class allows the building or loading of a vector database of concept names. This database can then be used for vector search.

Methods

__init__

method __init__(
	embeddings_path: str
	force_rebuild: bool
	embed_vocab: str[List]
	model_name: EmbeddingModelName
	search_kwargs: dict
)

Initialises the connection to an embeddings database

Parameters

embeddings_path: str A path for the embeddings database. If one is not found, it will be built, which takes a long time. This is built from concepts fetched from the OMOP database.

force_rebuild: bool If true, the embeddings database will be rebuilt.

embed_vocab: List[str] A list of OMOP vocabulary_ids. If the embeddings database is built, these will be the vocabularies used in the OMOP query.

model: EmbeddingModel The model used to create embeddings.

search_kwargs: dict kwargs for vector search.

_build_embeddings

def _build_embeddings()

Build a vector database of embeddings

_load_embeddings

def _load_embeddings()

If available, load a vector database of concept embeddings

get_embedder

def get_embedder()

Get an embedder for queries in LLM pipelines

Returns

FastembedTextEmbedder

get_retriever

def get_retriever()

Get a retriever for LLM pipelines

Returns

QdrantEmbeddingRetriever

def search(
	query: str[List]
)

Search the attached vector database with a list of informal medications

Parameters

query: List[str] A list of informal medication names

Returns

List[List[Dict[str, Any]]] For each medication in the query, the result of searching the vector database