Lettuce API ReferenceBuild Embeddings

Build embeddings

⚠️

If you run the tests with uv run pytest for the embeddings building code, there is a test that will overwrite the embeddings table in your attached OMOP-CDM database. Use the environment variable SKIP_DATABASE_TESTS=true to skip this test.

Lettuce requires a table of embeddings to read from for semantic search. If you have a parquet file of embeddings for your vocabularies, you can load them into a new postgres OMOP-CDM database as configured in omop-lite. This module lets you generate embeddings from a table of vocabularies, and either load these into a postgres database, or write them to a parquet file. If you want to load the embeddings into a postgres database, it must have PGVector installed. The vocabularies can be extracted from either the postgres database you want to load embeddings into, or a tab-delimited file, as downloaded from Athena.

The embeddings can use attributes of each concept using Jinja2 templates. The default is just to use the concept name. A simple example of what’s possible is:

TemplateExample result
{{concept_name}}Conjunctival concretion
{{concept_name}}, a {{concept_class}} {{domain}}Conjunctival concretion, a Disorder Condition

Usage

If you install build-embeddings, it can be run with that command. Otherwise, you can run it with uv run build-embeddings [ARGS]

ArgumentTypeDescription
—concept-sourcepostgres/csvThe source to use for concepts [required]
—embedding-modelTEXTString to fetch a SentenceTransformer [default: BAAI/bge-small-en-v1.5]
—templateTEXTString specification for a Jinja2 template for rendering a concept [default: {{concept_name}}]
—fetch-batch-sizeINTEGERNumber of concepts to extract at once if using the database [default: 16384]
—embed-batch-sizeINTEGERNumber of embeddings to generate at once if using the database [default: 512]
—db-load-methodreplace/extendHow to load embeddings in the database. If ‘replace’, drops any existing embeddings table. Otherwise extends the table [default: extend]
—source-pathTEXTPath for source csv if reading from file
—save-methodload_to_database/save_to_parquetWhether to save the embeddings to a file or load them into your database (only if loading from a database) [default: save_to_parquet]
—output-pathTEXTIf saving to a parquet file, the path for output

You can show these arguments and their descriptions with uv run build-embeddings --help

To write a parquet file from a vocabulary csv:

uv run build-embeddings --concept-source csv --source-path VOCABULARY.csv --output-path embeddings.parquet

The best overview of how this module works is in the protocols description

⚠️

If you run the tests with uv run pytest for the embeddings building code, there is a test that will overwrite the embeddings table in your attached OMOP-CDM database. Use the environment variable SKIP_DATABASE_TESTS=true to skip this test.