Custom/Bulk Ingestion

This document is aimed to dive a little bit deeper into how Simon ingestion works, and provide you some opportunities to customize how it works yourself. Please ensure that you are familiar with the contents of the Datastore API before reading this guide. Usually, we recommend using the Datastore API instead of the custom API here unless you have a specific reason to use this API (parallelization, custom cleanup, etc.).

If you do, let's get started.

Concept

Low-level ingestion in Simon works in three steps:

Parsing: turning PDF/text/webpage you wish to ingest into simon.ParsedDocument, which contains "chunks" of the text to ingest
(optional) Cleaning: chunks in the simon.ParsedDocument gets squished around as needed to ensure optimal search results
Indexing: simon.bulk_index([list, of, parseddocuments], context)

Especially for large ingestion jobs, we recommend having one worker perform tasks 1 and 2, then use a queue to hand cleaned simon.ParsedDocuments off to a bunch of parallel workers performing simon.bulk_index.

Parsing

Simon has three base parsing tools to help you easily create simon.ParsedDocument objects.

# our tools
from simon import parse_text, parse_tika, parse_web

# parsing raw text
doc = parse_text(text_str, title_str, source_str)

# OCR parse a file (requires Java)
doc = parse_tika(local_file_path, title_str, source_str)

# parse a web page
doc = parse_web(raw_html_text, title_str, source_str)

Though you can manually create a simon.ParsedDocument from its constructor, it is not recommended. Instead, we recommend reformatting the arguments of a simon.ParsedDocument after you have created it.

(optional) Cleaning

This step is frequently unnecessary if you are ingesting normal prose. However, if you are ingesting bulleted text, code, or other structured information, it maybe helpful to clean up the chunking of your ParsedDocument.

The property of interest is doc.paragraphs. It is a list of strings, and should contain semantic "chunks" (i.e. "paragraphs", or equivalents) of your document: that is, each element of that list should contain a subset of the document containing one distinct idea, ordered by appearance in the original document.

So, to clean up the ParsedDocument:

# doc = parse_text(...)

# get the default chunks, which is a list of text
chunks = doc.paragraphs
# clean them as needed: such as splitting a chunk into two, combining chunks, removing symbols etc.
chunks_cleaned = my_cleanup_function(chunks)
# set it back
doc.paragraphs = chunks_cleaned

(Parallelizable) Indexing

Once you get your parsed documents all cleaned, you are one step away from them being ingested into Simon!

from simon import bulk_index

# docs = [doc_1, doc_2, doc_3, ...]
bulk_index(docs)

We recommend using the bulk_index API for all tasks involving manually created ParsedDocument. The function should be thread safe (though, note, you may need a custom database connection that is thread safe) and has throughput as high as your database and embedding model can handle.