Implement a semantic search backend for Smart Start
Smart Start as it exists now is an advanced search feature which uses an LLM as an intent parser for a natural language query, and then filters the results by existing facets.
There are, however, some challenges with this:
- We're sending a large amount of JSON, a significant fraction of the metadata in the database, to the largest OpenAI model in about 14 parallel API calls on each query, which costs about 40 cents and takes about 4 seconds in the usual case, but there's significant variation, and one of those calls can take as long as 50 seconds.
- There's also significant variation in the quality and stability of the results, since the LLM's process is probabilistic.
- For topic-based filtering, we're relying on a key words and phrases field which includes more than 7,000 unique entries, requiring us to ask the LLM to invent keywords that we then fuzzy match to the facets (because the actual facets are too numerous to send to the LLM). This has caused some suggestions of cleaning up this field, but I think the field is potentially unreliable as-is, since it is clearly treated as open-ended by users.
- There is, at the moment, no way to search by title or description. These may have the most semantic meaning, since they potentially contain more context than key words.
- As Justin details in #47 (closed), the LLM has no knowledge of how sets of facets relate, causing it to sometimes filter too aggressively and return zero results (even when there are results that should be returned). This would be impossible for a user to select (because zero-result facets are removed as the user selects them), and annoying for a user to correct (because the LLM can check a lot of boxes and there's no good feedback about which ones to uncheck).
- The way we've tried to partly address this is to add "conceptual explosion," in which related facets are selected to pull in more results. But unlike a traditional search, there's no ranking of results and currently no way to put more relevant matches first, so the cutoff for what makes a facet conceptually similar will be tricky to determine.
Overall, I think we tackled the most advanced and fiddly version of a semantic search first, in part because when Smart Start was conceived it was based around using existing APIs and wasn't supposed to be search. It clearly is search now, though.
If tackling the hardest version of the feature first seems a little backwards, then I propose we reverse the process: retrieving results based on semantic fuzzy matching first.
Here's what I've seen work elsewhere to create semantic search and what I suggest we try:
- Create a semantic embedding (that is, use a model to turn the contents of the titles, descriptions and metadata into numbers) when the proposals are loaded into the database or when an indexing process happens.
- Store those in an index capable of retrieving approximate results quickly, perhaps alongside the rest of the data in Django's Postgres database using the pgvector extension (though other data stores may work, and we don't have to fully integrate this into Torque's database in the first iteration).
- When a user searches, create a semantic embedding for their query (that is, turn it into numbers) and run it against this index first, and get fuzzy matched results ranked by similarity, perhaps in milliseconds, or at most a second or two.
- We'll still receive facets (filtered to match the result set) and we can optionally use LLMs to evaluate and improve the results, perhaps continuing to select some relevant facets, ranking or summarizing.
Why this is good:
- Initial results will be available faster and more predictably.
- Results will be more consistent and stable, allowing us to optimize more confidently.
- We can fuzzy match to key words and phrases while not totally relying on them or their cleanliness.
- We can search titles and descriptions.
- This doesn't rely on facet filtering and returns a smaller set of facets, so should be less likely to pick zero-result combinations.
- We can rank more relevant results first, and include related results lower down.
It will act more like search, but LLM-powered search that takes into account semantic meaning.
This does involve some significant changes to our current approach, however, and there are some open questions about implementation. I do, however, think this is the right approach.