Skip to content

Commit

Permalink
Merge pull request #10 from openradx/pgsearch
Browse files Browse the repository at this point in the history
Use Postgres for full text search
  • Loading branch information
medihack authored Jul 20, 2024
2 parents dc296f2 + f567465 commit 8389f10
Show file tree
Hide file tree
Showing 49 changed files with 788 additions and 4,271 deletions.
12 changes: 0 additions & 12 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -36,18 +36,6 @@ RUN apt-get update \
# deps for db management commands
postgresql-client

# Install Vespa CLI
ADD https://github.com/vespa-engine/vespa/releases/download/v8.302.40/vespa-cli_8.302.40_linux_amd64.tar.gz /tmp/vespa-cli.tar.gz
RUN mkdir /tmp/vespa-cli \
&& tar -xzf /tmp/vespa-cli.tar.gz -C /tmp/vespa-cli --strip-components 1 \
&& cp -r /tmp/vespa-cli/bin/* /usr/local/bin/ \
&& cp -r /tmp/vespa-cli/share/* /usr/local/share/ \
&& rm -rf /tmp/vespa-cli.tar.gz /tmp/vespa-cli

# Download embedding model, which gets deployed by Django vespa management command (see vespa.py)
ADD "https://github.com/vespa-engine/sample-apps/raw/master/simple-semantic-search/model/e5-small-v2-int8.onnx" /opt/models/model.onnx
ADD "https://raw.githubusercontent.com/vespa-engine/sample-apps/master/simple-semantic-search/model/tokenizer.json" /opt/models/tokenizer.json


# `builder-base` stage is used to build deps + create our virtual environment
FROM python-base as builder-base
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Upcoming ...

## Architectural overview

RADIS is built using the Django web framework, and data is stored in a [PostgreSQL](https://www.postgresql.org/) database. Uploaded radiology reports are additionally stored and indexed for full text search in a [Vespa](https://vespa.ai/) vector database. The design of RADIS is very modular so that other text search databases can easily be integrated.
RADIS is built using the Django web framework, and data is stored in a [PostgreSQL](https://www.postgresql.org/) database, which is also use for full text search. The design of RADIS is very modular so that other text search databases can easily be integrated.

## Contributors

Expand Down
29 changes: 9 additions & 20 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,6 @@
- Some present provider.max_results to the user somehow, especially important if the query results (step 1) is larger
- task control panel
- Upgrade Github actions workflows (also ADIT)
- <https://docs.vespa.ai/en/operations/docker-containers.html#mounting-persistent-volumes>
- Change maxHits and maxOffset for farer pagination
- <https://docs.vespa.ai/en/reference/query-api-reference.html#native-execution-parameters>
- <https://pyvespa.readthedocs.io/en/latest/reference-api.html#queryprofile>
- Check if we can get rid of wsgi.py (also in RADIS)
- <https://forum.djangoproject.com/t/adding-asgi-support-to-runserver/2446/26>
- <https://github.com/django/django/pull/16634/files>
Expand All @@ -22,16 +18,21 @@

## Fix

- Update list of stop words
- Postgresql has stuff like "kein", "keine" in its stop words list
- Customize the list of stop words
- The original stop word list can be found in the container at /usr/share/postgresql/16/tsearch_data
- There are multiple ways to solve that:
- Completely disable stop words: <https://stackoverflow.com/a/2227235/166229>
- Create complete new dictionaries
- Mount manipulated stop word files (<https://github.com/postgres/postgres/tree/master/src/backend/snowball/stopwords>)

## Features

- Allow to re-feed Vespa documents
- Already WIP in branch vespa-re-feed
- Optionally allow to re-feed without full reset (only update documents, feed_iterable has an option for that)
- RAG app
- Let the user provide study date range (from, until), age, modality, keywords, question
- Add the search job to the queue (similar to transfer jobs in ADIT)
- Let a worker process the queue in its own service
- Pre-filter reports with a Vespa database search (semantic search?)
- Give each filtered report to an LLM and let it answer the question
- Constraint the output of the LLM, multiple possibilities for that:
- <https://github.com/outlines-dev/outlines>
Expand Down Expand Up @@ -69,24 +70,12 @@
-- That is why we currently only use semantic stuff as ranking algorithm
-- We could add a distanceThreshold to only allow really near neighbors
-- But the threshold is unclear (not sure if we should allow to the user to specify)
-- <https://docs.vespa.ai/en/nearest-neighbor-search-guide.html#strict-filters-and-distant-neighbors>
- Rename reports model fields to something in the HL7 FHIR standard
- Interesting resources in this regard are:
- <https://hl7.org/fhir/patient.html>
- <https://hl7.org/fhir/observation.html>
- <https://hl7.org/fhir/diagnosticreport.html>
- <https://hl7.org/fhir/imagingstudy.html>
- Adjust the summary dynamic snippets of the search results
- <https://docs.vespa.ai/en/document-summaries.html>
- Unfortunately, ApplicationConfiguration does not allow to put the configuration inside the content cluster (see link above)
- <https://github.com/vespa-engine/pyvespa/blob/75c64ab144f98155387ff1f461632b889c19bd6e/vespa/package.py#L1490>
- <https://github.com/vespa-engine/pyvespa/blob/master/vespa/templates/services.xml>
- That's why we would need to manipulate the XML files ourselves (maybe with <https://docs.python.org/3/library/xml.etree.elementtree.html>)
- or simply wait for <https://github.com/vespa-engine/pyvespa/issues/520>
- Put an extra "indication" field into the schema
- Also must be included in the ranking expression, see <https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html#Define-ranking>
- Multi node Vespa example setup
- <https://github.com/vespa-engine/sample-apps/blob/master/examples/operations/multinode-HA/>
- Standalone logging server
- SigNoz <https://github.com/signoz/signoz>
- Loki <https://github.com/grafana/loki>
Expand Down
13 changes: 1 addition & 12 deletions compose/docker-compose.base.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,6 @@ x-app: &default-app
USE_DOCKER: 1
DJANGO_STATIC_ROOT: "/var/www/radis/static/"
DATABASE_URL: "psql://postgres:[email protected]:5432/postgres"
OPENSEARCH_ENABLED: "${OPENSEARCH_ENABLED-true}"
OPENSEARCH_HOST: "opensearch-node1.local"
OPENSEARCH_PORT: "9200"
VESPA_ENABLED: "${VESPA_ENABLED-false}"
VESPA_HOST: "vespa.local"
VESPA_CONFIG_PORT: "19071"
VESPA_DATA_PORT: "8080"
RABBITMQ_URL: "amqp://rabbit"
RABBIT_MANAGEMENT_HOST: "rabbit"
RABBIT_MANAGEMENT_PORT: "15672"
Expand Down Expand Up @@ -43,10 +36,6 @@ services:
<<: *default-app
hostname: worker_default.local

worker_vespa:
<<: *default-app
hostname: worker_vespa.local

worker_llm:
<<: *default-app
hostname: worker_llm.local
Expand All @@ -65,7 +54,7 @@ services:
"
postgres:
image: postgres:16.2
image: postgres:16
hostname: postgres.local
volumes:
- postgres_data:/var/lib/postgresql/data
Expand Down
62 changes: 0 additions & 62 deletions compose/docker-compose.dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,14 +43,6 @@ services:
wait-for-it -s postgres.local:5432 -t 60 &&
./manage.py migrate &&
./manage.py collectstatic --no-input &&
if [[ $$OPENSEARCH_ENABLED == true ]]; then
wait-for-it -s opensearch-node1.local:9200 -t 60 &&
./manage.py opensearch --mappings dev
fi
if [[ $$VESPA_ENABLED == true ]]; then
wait-for-it -s vespa.local:19071 -t 60 &&
./manage.py vespa --generate --deploy
fi
./manage.py populate_users_and_groups --users 20 --groups 3 &&
./manage.py populate_reports --report-language de &&
./manage.py runserver 0.0.0.0:8000
Expand All @@ -61,12 +53,6 @@ services:
command: |
./manage.py celery_worker -c 1 -Q default_queue --autoreload
worker_vespa:
<<: *default-app
command: |
./manage.py celery_worker -c 1 -Q vespa_queue --autoreload
profiles: ["vespa"]

worker_llm:
<<: *default-app
command: |
Expand Down Expand Up @@ -103,55 +89,7 @@ services:
environment:
POSTGRES_PASSWORD: "postgres"

# https://opensearch.org/docs/latest/install-and-configure/install-opensearch/docker/#sample-docker-composeyml
opensearch_node1:
image: opensearchproject/opensearch:latest
hostname: opensearch-node1.local
volumes:
- opensearch_data1:/usr/share/opensearch/data
environment:
discovery.type: single-node # https://github.com/gitpod-io/gitpod/issues/8399
OPENSEARCH_JAVA_OPTS: "-Xms512m -Xmx512m"
DISABLE_SECURITY_PLUGIN: "true"
ports:
- 9200:9200 # REST API
- 9600:9600 # Performance Analyzer
profiles: ["opensearch"]

opensearch_dashboards:
image: opensearchproject/opensearch-dashboards:latest
hostname: opensearch-dashboards.local
ports:
- 5601:5601
environment:
DISABLE_SECURITY_DASHBOARDS_PLUGIN: "true"
OPENSEARCH_HOSTS: '["http://opensearch-node1.local:9200"]'
profiles: ["opensearch"]

vespa:
image: vespaengine/vespa:8
hostname: vespa.local
healthcheck:
test: curl http://localhost:19071/state/v1/health
timeout: 10s
retries: 3
start_period: 40s
volumes:
- vespa_data:/opt/vespa/var
- vespa_logs:/opt/vespa/logs
environment:
# Reduce memory usage of Vespa during development (espcially on Cloud IDEs), see
# https://docs.vespa.ai/en/operations/node-setup.html#memory-settings
VESPA_CONFIGSERVER_JVMARGS: "-Xms32M -Xmx128M"
VESPA_CONFIGPROXY_JVMARGS: "-Xms32M -Xmx32M"
ports:
- 9620:8080
profiles: ["vespa"]

volumes:
vscode-server:
vscode-server-insiders:
models_data:
opensearch_data1:
vespa_data:
vespa_logs:
63 changes: 0 additions & 63 deletions compose/docker-compose.prod.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,6 @@ services:
./manage.py collectstatic --no-input &&
./manage.py create_admin &&
./manage.py generate_cert &&
wait-for-it -s opensearch-node1.local:9200 -t 60 &&
./manage.py opensearch --mappings prod &&
# wait-for-it -s vespa.local:19071 -t 60 &&
# ./manage.py vespa --generate --deploy &&
./manage.py ok_server --host 0.0.0.0 --port 8000
"
deploy:
Expand Down Expand Up @@ -60,12 +56,6 @@ services:
deploy:
<<: *deploy

worker_vespa:
<<: *default-app
command: ./manage.py celery_worker -c 1 -Q vespa_queue
deploy:
<<: *deploy

worker_llm:
<<: *default-app
command: ./manage.py celery_worker -c 1 -Q llm_queue
Expand Down Expand Up @@ -111,56 +101,6 @@ services:
deploy:
<<: *deploy

opensearch_node1:
image: opensearchproject/opensearch:2
hostname: opensearch-node1.local
environment:
discovery.type: single-node # https://github.com/gitpod-io/gitpod/issues/8399
bootstrap.memory_lock: "true"
OPENSEARCH_JAVA_OPTS: "-Xms4g -Xmx4g"
DISABLE_SECURITY_PLUGIN: "true"
env_file:
- ../.env.prod
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
volumes:
- opensearch_data1:/usr/share/opensearch/data
ports:
- 9200:9200 # REST API
- 9600:9600 # Performance Analyzer
deploy:
<<: *deploy

opensearch-dashboards:
image: opensearchproject/opensearch-dashboards:2
ports:
- 5601:5601
environment:
OPENSEARCH_HOSTS: '["https://opensearch-node1.local:9200"]'
deploy:
<<: *deploy

vespa:
image: vespaengine/vespa:8
hostname: vespa.local
healthcheck:
test: curl http://localhost:19071/state/v1/health
timeout: 10s
retries: 3
start_period: 40s
volumes:
- vespa_data:/opt/vespa/var
- vespa_logs:/opt/vespa/logs
ports:
- 9620:8080
deploy:
replicas: 0

rabbit:
deploy:
<<: *deploy
Expand All @@ -171,6 +111,3 @@ services:

volumes:
models_data:
opensearch_data1:
vespa_data:
vespa_logs:
2 changes: 1 addition & 1 deletion doc/Maintenance.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ There are different things that can be upgraded:
- Other upgrades (e.g. major versions) must be upgraded by modifying the version range in `packages.json` before calling `npm update`
- After an upgrade make sure the files in `static/vendor` still link to the correct files in `node_modules`1
- Python and Poetry in `Dockerfile` that builds the container where RADIS runs in
- Dependent services in `docker-compose.base.yml`, like PostgreSQL or Vespa database
- Dependent services in `docker-compose.base.yml`, like PostgreSQL
- Gitpod development container dependencies in `.gitpod.Dockerfile`
- Github Codespaces development container dependencies in `.devcontainer/devcontainer.json` and `.devcontainer/Dockerfile`
- Github actions `.github/workflows/ci.yml` dependencies
1 change: 0 additions & 1 deletion example.env
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ FORCE_DEBUG_TOOLBAR=false
LLM_MODEL_URL=https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q2_K.gguf
OPENAI_API_KEY=
POSTGRES_PASSWORD=mysecret
OPENSEARCH_INITIAL_ADMIN_PASSWORD=mysecret
SITE_BASE_URL=http://localhost:8000
SITE_DOMAIN=localhost
SITE_NAME=RADIS
Expand Down
Loading

0 comments on commit 8389f10

Please sign in to comment.