Custom/Bulk Ingestion
This document is aimed to dive a little bit deeper into how Simon ingestion works, and provide you some opportunities to customize how it works yourself. Please ensure that you are familiar with the contents of the Datastore API before reading this guide. Usually, we recommend using the Datastore
API instead of the custom API here unless you have a specific reason to use this API (parallelization, custom cleanup, etc.).
If you do, let's get started.
Concept
Low-level ingestion in Simon
works in three steps:
- Parsing: turning PDF/text/webpage you wish to ingest into
simon.ParsedDocument
, which contains "chunks" of the text to ingest - (optional) Cleaning: chunks in the
simon.ParsedDocument
gets squished around as needed to ensure optimal search results - Indexing:
simon.bulk_index([list, of, parseddocuments], context)
Especially for large ingestion jobs, we recommend having one worker perform tasks 1
and 2
, then use a queue to hand cleaned simon.ParsedDocument
s off to a bunch of parallel workers performing simon.bulk_index
.
Parsing
Simon
has three base parsing tools to help you easily create simon.ParsedDocument
objects.
# our tools
from simon import parse_text, parse_tika, parse_web
# parsing raw text
doc = parse_text(text_str, title_str, source_str)
# OCR parse a file (requires Java)
doc = parse_tika(local_file_path, title_str, source_str)
# parse a web page
doc = parse_web(raw_html_text, title_str, source_str)
Though you can manually create a simon.ParsedDocument
from its constructor, it is not recommended. Instead, we recommend reformatting the arguments of a simon.ParsedDocument
after you have created it.
(optional) Cleaning
This step is frequently unnecessary if you are ingesting normal prose. However, if you are ingesting bulleted text, code, or other structured information, it maybe helpful to clean up the chunking of your ParsedDocument
.
The property of interest is doc.paragraphs
. It is a list of strings, and should contain semantic "chunks" (i.e. "paragraphs", or equivalents) of your document: that is, each element of that list should contain a subset of the document containing one distinct idea, ordered by appearance in the original document.
So, to clean up the ParsedDocument
:
# doc = parse_text(...)
# get the default chunks, which is a list of text
chunks = doc.paragraphs
# clean them as needed: such as splitting a chunk into two, combining chunks, removing symbols etc.
chunks_cleaned = my_cleanup_function(chunks)
# set it back
doc.paragraphs = chunks_cleaned
(Parallelizable) Indexing
Once you get your parsed documents all cleaned, you are one step away from them being ingested into Simon!
We recommend using the bulk_index
API for all tasks involving manually created ParsedDocument
. The function should be thread safe (though, note, you may need a custom database connection that is thread safe) and has throughput as high as your database and embedding model can handle.