This document is aimed to dive a little bit deeper into how Simon ingestion works, and provide you some opportunities to customize how it works yourself. Please ensure that you are familiar with the contents of the Datastore API before reading this guide. Usually, we recommend using the
Datastore API instead of the custom API here unless you have a specific reason to use this API (parallelization, custom cleanup, etc.).
If you do, let's get started.
Low-level ingestion in
Simon works in three steps:
- Parsing: turning PDF/text/webpage you wish to ingest into
simon.ParsedDocument, which contains "chunks" of the text to ingest
- (optional) Cleaning: chunks in the
simon.ParsedDocumentgets squished around as needed to ensure optimal search results
simon.bulk_index([list, of, parseddocuments], context)
Especially for large ingestion jobs, we recommend having one worker perform tasks
2, then use a queue to hand cleaned
simon.ParsedDocuments off to a bunch of parallel workers performing
Simon has three base parsing tools to help you easily create
Though you can manually create a
simon.ParsedDocument from its constructor, it is not recommended. Instead, we recommend reformatting the arguments of a
simon.ParsedDocument after you have created it.
This step is frequently unnecessary if you are ingesting normal prose. However, if you are ingesting bulleted text, code, or other structured information, it maybe helpful to clean up the chunking of your
The property of interest is
doc.paragraphs. It is a list of strings, and should contain semantic "chunks" (i.e. "paragraphs", or equivalents) of your document: that is, each element of that list should contain a subset of the document containing one distinct idea, ordered by appearance in the original document.
So, to clean up the
Once you get your parsed documents all cleaned, you are one step away from them being ingested into Simon!
We recommend using the
bulk_index API for all tasks involving manually created
ParsedDocument. The function should be thread safe (though, note, you may need a custom database connection that is thread safe) and has throughput as high as your database and embedding model can handle.