Add semantic search (!107) · Merge requests · ots / MediaWiki / torque

Chris Zubak-Skees requested to merge add-semantic-search into main Oct 17, 2024

This MR addresses ots/llm/meta#54 (closed) by adding a proof-of-concept semantic similarity search to Torque. It's the first step to using this functionality in Smart Start.

Here's an example:

It accomplishes this by adding an app to provide semantic vector embeddings of metadata, titles and descriptions of the proposals in a search cache and a method to use this to do semantic similarity searches to the Torque Explore feature.

This adds the feature to Torque as an app, because Torque already has the logic for faceting and user permissions. We'd have to duplicate that elsewhere otherwise, and respecting user permissions may be best done here.

Setup

First, you need a working Torque install. A future MR will add Ansible setup steps to automate setup, this gives you some idea in the meantime:

pipenv install django-torque[semantic_search]

This requires pgvector, which can be installed this way:

sudo apt install -y postgresql-common
sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
sudo apt install postgresql-13-pgvector # where 13 is the major version of Postgres

It requires the LLM API to be running:

git clone https://code.librehq.com/ots/llm/llm-api.git
cd llm-api
cp .env.example .env.local # then configure this
make run

It requires the following in setup.py:

INSTALLED_APPS = [
    # ...
    "torque",
    "torque.cache_rebuilder",
    "torque.semantic_search",
    # ...
]

class ProjectTitle(utils.Filter):
    def name(self):
        return "project_title"

    def document_value(self, document):
        return document.get("Project Title", "")


class ProjectDescription(utils.Filter):
    def name(self):
        return "project_description"

    def document_value(self, document):
        return document.get("Project Description", "")

SEMANTIC_SEARCH_EMBEDDING_API_KEY = "your-api-key" # replace with your llm-api API key
SEMANTIC_SEARCH_EMBEDDING_API_BASE_URL = "http://localhost:8889" # replace with the base URL of your llm-api
SEMANTIC_SEARCH_ADDITIONAL_FILTERS = [
    ProjectTitle(),
    ProjectDescription(),
]

It requires a migration, e.g.:

sudo -u postgres psql torque -c "create extension vector;"
pipenv run python manage.py migrate semantic_search

(The migration does try to create the extension, but it fails if not run as a database superuser, so we do that separately.)

And it requires a reindex of the search cache, e.g.:

sudo -u postgres psql torque -c "update torque_searchcachedocument set dirty = true;"
pipenv run python manage.py run_cache_rebuilder

Once all that's been done, you should be able to query the data in Explore by filtering using Terms in the All Filters pane. (This is not ultimately how I think it should be used, rather the input should come from the Smart Start search box, but this is how I'm testing.)

Edited Oct 29, 2024 by Chris Zubak-Skees

Add semantic search

Setup

Merge request reports