Skip to content

Draft: Add semantic search

Chris Zubak-Skees requested to merge add-semantic-search into main

This MR addresses ots/llm/meta#54 by adding a proof-of-concept semantic similarity search to Torque. It's the first step to using this functionality in Smart Start.

Here's an example:

Screenshot_2024-10-17_at_9.26.57_PM.png

It accomplishes this by adding a field with semantic vector embeddings of metadata, titles and descriptions of the proposals to the search cache and a method to use this to do semantic similarity searches to the Torque Explore feature.

This is in some ways the simplest possible implementation, but it may not be the final design we settle on. It's meant to demonstrate one approach and advance the conversation.

Somewhat controversially, this adds the feature to Torque, not the LLM API, because Torque already has the logic for faceting and user permissions. We'd have to duplicate that elsewhere if we added it elsewhere, and respecting user permissions may be best done here. This essentially keeps Torque responsible for data storage. However, there are parts of this, like generating the embeddings which may be best done elsewhere, and perhaps that should be a call to the LLM API instead.

This does not attempt to do chunking yet, it just relies on the portion of the document it indexes mostly fitting in approximately ~250 words (which roughly corresponds to the size of embeddings we're storing). A future version might combine this with traditional text search or further refine the scoring.

Setup

First, you need a working Torque install. A future MR will add Ansible setup steps to automate setup, this gives you some idea in the meantime:

pipenv install

This requires pgvector, which can be installed this way:

sudo apt install -y postgresql-common
sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
sudo apt install postgresql-13-pgvector # where 13 is the major version of Postgres

It requires a migration, e.g.:

sudo -u postgres psql torque -c "create extension vector;"
pipenv run python manage.py migrate

(The migration does try to create the extension, but it fails if not run as a database superuser, so we do that separately.)

It requires the following in setup.py:

TORQUE_SEARCH_EMBEDDING_MODEL = "Snowflake/snowflake-arctic-embed-m-v1.5"
TORQUE_SEARCH_EMBEDDING_NUM_DIMENSIONS = 768
TORQUE_SEARCH_EMBEDDING_ADDITIONAL_FIELDS = [
    "Project Title",
    "Project Description"
]

And it requires a reindex of the search cache, e.g.:

sudo -u postgres psql torque -c "update torque_searchcachedocument set dirty = true;"
pipenv run python manage.py run_cache_rebuilder

This cache-rebuilding step is slow, perhaps another argument for offloading the embedding task.

Once all that's been done, you should be able to query the data in Explore using semantic search at a URL like: http://localhost/DemoView/index.php/Special:TorqueExplore?f=%7B%22admin_review%22%3A%5B%22Valid%22%5D%2C%22competition_status%22%3A%5B%22Active%22%5D%7D&q%5B%5D=water+in+india&similarity=0.6

The search should appear in the terms field below the Smart Start box and &similarity=0.6 controls the cutoff for results with approximate cosine similarity scores the search will return. There is currently no UI to turn this on, you have to add &similarity=0.6 yourself. The UI will be handled in a separate MR.

Edited by Chris Zubak-Skees

Merge request reports

Loading