Merge pull request #10 from openradx/pgsearch

Use Postgres for full text search
openradx · Jul 20, 2024 · 8389f10 · 8389f10
2 parents dc296f2 + f567465
commit 8389f10
Show file tree

Hide file tree

Showing 49 changed files with 788 additions and 4,271 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -36,18 +36,6 @@ RUN apt-get update \
     # deps for db management commands
     postgresql-client
 
-# Install Vespa CLI
-ADD https://github.com/vespa-engine/vespa/releases/download/v8.302.40/vespa-cli_8.302.40_linux_amd64.tar.gz /tmp/vespa-cli.tar.gz
-RUN  mkdir /tmp/vespa-cli \
-    && tar -xzf /tmp/vespa-cli.tar.gz -C /tmp/vespa-cli --strip-components 1 \
-    && cp -r /tmp/vespa-cli/bin/* /usr/local/bin/ \
-    && cp -r /tmp/vespa-cli/share/* /usr/local/share/ \
-    && rm -rf /tmp/vespa-cli.tar.gz /tmp/vespa-cli
-
-# Download embedding model, which gets deployed by Django vespa management command (see vespa.py)
-ADD "https://github.com/vespa-engine/sample-apps/raw/master/simple-semantic-search/model/e5-small-v2-int8.onnx" /opt/models/model.onnx
-ADD "https://raw.githubusercontent.com/vespa-engine/sample-apps/master/simple-semantic-search/model/tokenizer.json" /opt/models/tokenizer.json
-
 
 # `builder-base` stage is used to build deps + create our virtual environment
 FROM python-base as builder-base

diff --git a/README.md b/README.md
@@ -29,7 +29,7 @@ Upcoming ...
 
 ## Architectural overview
 
-RADIS is built using the Django web framework, and data is stored in a [PostgreSQL](https://www.postgresql.org/) database. Uploaded radiology reports are additionally stored and indexed for full text search in a [Vespa](https://vespa.ai/) vector database. The design of RADIS is very modular so that other text search databases can easily be integrated.
+RADIS is built using the Django web framework, and data is stored in a [PostgreSQL](https://www.postgresql.org/) database, which is also use for full text search. The design of RADIS is very modular so that other text search databases can easily be integrated.
 
 ## Contributors
 

diff --git a/TODO.md b/TODO.md
@@ -7,10 +7,6 @@
 - Some present provider.max_results to the user somehow, especially important if the query results (step 1) is larger
 - task control panel
 - Upgrade Github actions workflows (also ADIT)
-- <https://docs.vespa.ai/en/operations/docker-containers.html#mounting-persistent-volumes>
-- Change maxHits and maxOffset for farer pagination
-  - <https://docs.vespa.ai/en/reference/query-api-reference.html#native-execution-parameters>
-  - <https://pyvespa.readthedocs.io/en/latest/reference-api.html#queryprofile>
 - Check if we can get rid of wsgi.py (also in RADIS)
   - <https://forum.djangoproject.com/t/adding-asgi-support-to-runserver/2446/26>
   - <https://github.com/django/django/pull/16634/files>
@@ -22,16 +18,21 @@
 
 ## Fix
 
+- Update list of stop words
+  - Postgresql has stuff like "kein", "keine" in its stop words list
+  - Customize the list of stop words
+  - The original stop word list can be found in the container at /usr/share/postgresql/16/tsearch_data
+  - There are multiple ways to solve that:
+    - Completely disable stop words: <https://stackoverflow.com/a/2227235/166229>
+    - Create complete new dictionaries
+    - Mount manipulated stop word files (<https://github.com/postgres/postgres/tree/master/src/backend/snowball/stopwords>)
+
 ## Features
 
-- Allow to re-feed Vespa documents
-  - Already WIP in branch vespa-re-feed
-  - Optionally allow to re-feed without full reset (only update documents, feed_iterable has an option for that)
 - RAG app
   - Let the user provide study date range (from, until), age, modality, keywords, question
   - Add the search job to the queue (similar to transfer jobs in ADIT)
   - Let a worker process the queue in its own service
-  - Pre-filter reports with a Vespa database search (semantic search?)
   - Give each filtered report to an LLM and let it answer the question
   - Constraint the output of the LLM, multiple possibilities for that:
     - <https://github.com/outlines-dev/outlines>
@@ -69,24 +70,12 @@
   -- That is why we currently only use semantic stuff as ranking algorithm
   -- We could add a distanceThreshold to only allow really near neighbors
   -- But the threshold is unclear (not sure if we should allow to the user to specify)
-  -- <https://docs.vespa.ai/en/nearest-neighbor-search-guide.html#strict-filters-and-distant-neighbors>
 - Rename reports model fields to something in the HL7 FHIR standard
   - Interesting resources in this regard are:
     - <https://hl7.org/fhir/patient.html>
     - <https://hl7.org/fhir/observation.html>
     - <https://hl7.org/fhir/diagnosticreport.html>
     - <https://hl7.org/fhir/imagingstudy.html>
-- Adjust the summary dynamic snippets of the search results
-  - <https://docs.vespa.ai/en/document-summaries.html>
-  - Unfortunately, ApplicationConfiguration does not allow to put the configuration inside the content cluster (see link above)
-    - <https://github.com/vespa-engine/pyvespa/blob/75c64ab144f98155387ff1f461632b889c19bd6e/vespa/package.py#L1490>
-    - <https://github.com/vespa-engine/pyvespa/blob/master/vespa/templates/services.xml>
-  - That's why we would need to manipulate the XML files ourselves (maybe with <https://docs.python.org/3/library/xml.etree.elementtree.html>)
-    - or simply wait for <https://github.com/vespa-engine/pyvespa/issues/520>
-- Put an extra "indication" field into the schema
-  - Also must be included in the ranking expression, see <https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html#Define-ranking>
-- Multi node Vespa example setup
-  - <https://github.com/vespa-engine/sample-apps/blob/master/examples/operations/multinode-HA/>
 - Standalone logging server
   - SigNoz <https://github.com/signoz/signoz>
   - Loki <https://github.com/grafana/loki>

diff --git a/compose/docker-compose.base.yml b/compose/docker-compose.base.yml
@@ -7,13 +7,6 @@ x-app: &default-app
     USE_DOCKER: 1
     DJANGO_STATIC_ROOT: "/var/www/radis/static/"
     DATABASE_URL: "psql://postgres:[email protected]:5432/postgres"
-    OPENSEARCH_ENABLED: "${OPENSEARCH_ENABLED-true}"
-    OPENSEARCH_HOST: "opensearch-node1.local"
-    OPENSEARCH_PORT: "9200"
-    VESPA_ENABLED: "${VESPA_ENABLED-false}"
-    VESPA_HOST: "vespa.local"
-    VESPA_CONFIG_PORT: "19071"
-    VESPA_DATA_PORT: "8080"
     RABBITMQ_URL: "amqp://rabbit"
     RABBIT_MANAGEMENT_HOST: "rabbit"
     RABBIT_MANAGEMENT_PORT: "15672"
@@ -43,10 +36,6 @@ services:
     <<: *default-app
     hostname: worker_default.local
 
-  worker_vespa:
-    <<: *default-app
-    hostname: worker_vespa.local
-
   worker_llm:
     <<: *default-app
     hostname: worker_llm.local
@@ -65,7 +54,7 @@ services:
       "
 
   postgres:
-    image: postgres:16.2
+    image: postgres:16
     hostname: postgres.local
     volumes:
       - postgres_data:/var/lib/postgresql/data

diff --git a/compose/docker-compose.dev.yml b/compose/docker-compose.dev.yml
@@ -43,14 +43,6 @@ services:
         wait-for-it -s postgres.local:5432 -t 60 && 
         ./manage.py migrate &&
         ./manage.py collectstatic --no-input &&
-        if [[ $$OPENSEARCH_ENABLED == true ]]; then
-          wait-for-it -s opensearch-node1.local:9200 -t 60 &&
-          ./manage.py opensearch --mappings dev
-        fi
-        if [[ $$VESPA_ENABLED == true ]]; then
-          wait-for-it -s vespa.local:19071 -t 60 &&
-          ./manage.py vespa --generate --deploy
-        fi
         ./manage.py populate_users_and_groups --users 20 --groups 3 &&
         ./manage.py populate_reports --report-language de &&
         ./manage.py runserver 0.0.0.0:8000
@@ -61,12 +53,6 @@ services:
     command: |
       ./manage.py celery_worker -c 1 -Q default_queue --autoreload
 
-  worker_vespa:
-    <<: *default-app
-    command: |
-      ./manage.py celery_worker -c 1 -Q vespa_queue --autoreload
-    profiles: ["vespa"]
-
   worker_llm:
     <<: *default-app
     command: |
@@ -103,55 +89,7 @@ services:
     environment:
       POSTGRES_PASSWORD: "postgres"
 
-  # https://opensearch.org/docs/latest/install-and-configure/install-opensearch/docker/#sample-docker-composeyml
-  opensearch_node1:
-    image: opensearchproject/opensearch:latest
-    hostname: opensearch-node1.local
-    volumes:
-      - opensearch_data1:/usr/share/opensearch/data
-    environment:
-      discovery.type: single-node # https://github.com/gitpod-io/gitpod/issues/8399
-      OPENSEARCH_JAVA_OPTS: "-Xms512m -Xmx512m"
-      DISABLE_SECURITY_PLUGIN: "true"
-    ports:
-      - 9200:9200 # REST API
-      - 9600:9600 # Performance Analyzer
-    profiles: ["opensearch"]
-
-  opensearch_dashboards:
-    image: opensearchproject/opensearch-dashboards:latest
-    hostname: opensearch-dashboards.local
-    ports:
-      - 5601:5601
-    environment:
-      DISABLE_SECURITY_DASHBOARDS_PLUGIN: "true"
-      OPENSEARCH_HOSTS: '["http://opensearch-node1.local:9200"]'
-    profiles: ["opensearch"]
-
-  vespa:
-    image: vespaengine/vespa:8
-    hostname: vespa.local
-    healthcheck:
-      test: curl http://localhost:19071/state/v1/health
-      timeout: 10s
-      retries: 3
-      start_period: 40s
-    volumes:
-      - vespa_data:/opt/vespa/var
-      - vespa_logs:/opt/vespa/logs
-    environment:
-      # Reduce memory usage of Vespa during development (espcially on Cloud IDEs), see
-      # https://docs.vespa.ai/en/operations/node-setup.html#memory-settings
-      VESPA_CONFIGSERVER_JVMARGS: "-Xms32M -Xmx128M"
-      VESPA_CONFIGPROXY_JVMARGS: "-Xms32M -Xmx32M"
-    ports:
-      - 9620:8080
-    profiles: ["vespa"]
-
 volumes:
   vscode-server:
   vscode-server-insiders:
   models_data:
-  opensearch_data1:
-  vespa_data:
-  vespa_logs:
diff --git a/compose/docker-compose.prod.yml b/compose/docker-compose.prod.yml
@@ -28,10 +28,6 @@ services:
         ./manage.py collectstatic --no-input &&
         ./manage.py create_admin &&
         ./manage.py generate_cert &&
-        wait-for-it -s opensearch-node1.local:9200 -t 60 &&
-        ./manage.py opensearch --mappings prod &&
-        # wait-for-it -s vespa.local:19071 -t 60 &&
-        # ./manage.py vespa --generate --deploy &&
         ./manage.py ok_server --host 0.0.0.0 --port 8000
       "
     deploy:
@@ -60,12 +56,6 @@ services:
     deploy:
       <<: *deploy
 
-  worker_vespa:
-    <<: *default-app
-    command: ./manage.py celery_worker -c 1 -Q vespa_queue
-    deploy:
-      <<: *deploy
-
   worker_llm:
     <<: *default-app
     command: ./manage.py celery_worker -c 1 -Q llm_queue
@@ -111,56 +101,6 @@ services:
     deploy:
       <<: *deploy
 
-  opensearch_node1:
-    image: opensearchproject/opensearch:2
-    hostname: opensearch-node1.local
-    environment:
-      discovery.type: single-node # https://github.com/gitpod-io/gitpod/issues/8399
-      bootstrap.memory_lock: "true"
-      OPENSEARCH_JAVA_OPTS: "-Xms4g -Xmx4g"
-      DISABLE_SECURITY_PLUGIN: "true"
-    env_file:
-      - ../.env.prod
-    ulimits:
-      memlock:
-        soft: -1
-        hard: -1
-      nofile:
-        soft: 65536
-        hard: 65536
-    volumes:
-      - opensearch_data1:/usr/share/opensearch/data
-    ports:
-      - 9200:9200 # REST API
-      - 9600:9600 # Performance Analyzer
-    deploy:
-      <<: *deploy
-
-  opensearch-dashboards:
-    image: opensearchproject/opensearch-dashboards:2
-    ports:
-      - 5601:5601
-    environment:
-      OPENSEARCH_HOSTS: '["https://opensearch-node1.local:9200"]'
-    deploy:
-      <<: *deploy
-
-  vespa:
-    image: vespaengine/vespa:8
-    hostname: vespa.local
-    healthcheck:
-      test: curl http://localhost:19071/state/v1/health
-      timeout: 10s
-      retries: 3
-      start_period: 40s
-    volumes:
-      - vespa_data:/opt/vespa/var
-      - vespa_logs:/opt/vespa/logs
-    ports:
-      - 9620:8080
-    deploy:
-      replicas: 0
-
   rabbit:
     deploy:
       <<: *deploy
@@ -171,6 +111,3 @@ services:
 
 volumes:
   models_data:
-  opensearch_data1:
-  vespa_data:
-  vespa_logs:
diff --git a/doc/Maintenance.md b/doc/Maintenance.md
@@ -14,7 +14,7 @@ There are different things that can be upgraded:
   - Other upgrades (e.g. major versions) must be upgraded by modifying the version range in `packages.json` before calling `npm update`
   - After an upgrade make sure the files in `static/vendor` still link to the correct files in `node_modules`1
 - Python and Poetry in `Dockerfile` that builds the container where RADIS runs in
-- Dependent services in `docker-compose.base.yml`, like PostgreSQL or Vespa database
+- Dependent services in `docker-compose.base.yml`, like PostgreSQL
 - Gitpod development container dependencies in `.gitpod.Dockerfile`
 - Github Codespaces development container dependencies in `.devcontainer/devcontainer.json` and `.devcontainer/Dockerfile`
 - Github actions `.github/workflows/ci.yml` dependencies
diff --git a/example.env b/example.env
@@ -21,7 +21,6 @@ FORCE_DEBUG_TOOLBAR=false
 LLM_MODEL_URL=https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q2_K.gguf
 OPENAI_API_KEY=
 POSTGRES_PASSWORD=mysecret
-OPENSEARCH_INITIAL_ADMIN_PASSWORD=mysecret
 SITE_BASE_URL=http://localhost:8000
 SITE_DOMAIN=localhost
 SITE_NAME=RADIS