Increase the percentage of document included in semantic search

Right now in Semantic Search we're dependent on the documents we're indexing with the embedding model being small in order to be searchable because we're only generating one embedding per document of a fixed size. The exact cut-off is unknown, but it might be somewhere around 250 tokens. Unfortunately some documents are larger than that, so some relevant information is likely being excluded from search. This is demonstrable in a search for "India." 700 documents could be returned, but I'm seeing far fewer than that, probably because the field that contains the project location is past the cutoff. The technical task here would be to chunk the document before feeding it into the embedding model much as @gridinoc has been experimenting with (possibly in the embedding API, or the Semantic Search Django app) and then adapt the search to take that into account (ideally by computing some sort of score based on multiple matches and re-ranking, or failing that, at least making sure we only return one distinct document). The way Semantic Search is built anticipates this: multiple embeddings can be already be returned and stored per document. A possible challenge here includes computing scores in the Django ORM in a way that fits into the search architecture. If we manage to solve that, though, hybrid search becomes possible.

Edited Nov 02, 2024 by Chris Zubak-Skees