Protocols
Embeddings pipelines require the extraction of concepts from some source, encoding these as embeddings, and loading this to some output.
To manage this, there are protocols that should make integration of new sources, embedders, and sinks.
Classes
EmbeddingPipeline
class EmbeddingPipeline(Protocol):
reader: ConceptReader
embedder: ConceptEmbedder
store: EmbeddingStoreProtocol for a pipeline that reads, encodes, and loads concepts
ConceptReaders have to expose two methods:load_concepts, which returns a list ofConcepts, andload_concept_batch, which returns a Generator for a list ofConcepts.ConceptEmbedders have to expose a method,embed_concepts, which takes a list ofConcepts and returns a list ofEmbeddedConcepts.EmbeddingStores have to expose a method,save, which takes a list ofEmbeddedConcepts and doesn’t return anything.
This means that an EmbeddingPipeline can use reader to fetch (a batch of) concepts, feed these through the embedder and use the store to save them.
Batched and complete pipelines are implemented in fetch_concept_batches.
Methods
run_pipeline
run_pipeline() -> NoneRun the embedding pipeline.
ConceptReader
class ConceptReader(Protocol):
_batch_size: intProtocol for a concept reader that can either take a full set of concepts or read in batches
Methods
load_concept_batch
load_concept_batch() -> Generator[list[Concept]]Return a Generator to iterate through loaded concepts in batches.
load_concepts
load_concepts() -> list[Concept]ConceptEmbedder
class ConceptEmbedder(Protocol)Protocol for a thing that can take concepts and produce embeddings
Methods
embed_concepts
embed_concepts(concepts: list[Concept]) -> list[EmbeddedConcept]Take a list of concepts and encode them into embeddings
EmbeddingStore
class EmbeddingStore(Protocol)Protocol for taking embeddings and storing them somewhere
Methods
save
save(embeddings: list[EmbeddedConcept]) -> NoneTake a list of embeddings and save them somewhere.
EmbeddedConcept
class EmbeddedConcept:
concept_id: int
concept_name: str
embedding: list[float]Dataclass to hold identifiers for a concept and its embedding