Skip to content

πŸ”Ž A name resolution service for biomedical concepts, using vector databases and similarity search

License

Notifications You must be signed in to change notification settings

vemonet/concept-resolver

Repository files navigation

A name resolution service for biomedical concepts, using vector databases and similarity search

Problem Statement

Resolving concept labels to standardized identifiers from existing databases is a fundamental requirement in the process of annotating biomedical data. While several annotation services, including BioPortal and the Translator Name Resolution service, are available, most of them rely on straightforward matching mechanisms (respectively mgrep and solr). Unfortunately, these mechanisms often fall short when dealing with concept labels that exhibit substantial variations from standardized identifiers or when dealing with synonyms.

Approach

We propose to explore the use of vector similarity search to improve the accuracy of concept resolution. We will leverage the extensive dataset gathered by the Translator Babel project, which includes a vast repository of identifiers, labels, and synonyms from the biomedical domain (PubChem, CHEMBL, UniProt, MONDO, OMIM, HGNC, DrugBank, and more).

Objectives

During the Biomedical Linked Annotation Hackathon, our key objectives are as follows:

  1. Choosing a vector database and text embeddings model: we will evaluate the available open-source vector database and text embeddings models to choose one that fits our needs. We might also choose multiple, and compare their results.
  2. Data ingestion: we will establish a workflow to generate embeddings and ingest the data from the Translator Babel project into a vector database. This database will serve as the foundation for our name resolution service.
  3. Vector similarity search: we will implement a service that will enable users to retrieve potential identifiers for a given concept label, along with scores indicating the degree of confidence. This service will use the vector database similarity search implementation
  4. Evaluation: we will look into existing datasets to benchmark the efficiency of our approach, and compare it to existing services
  5. Exploring use cases: in addition to concept resolution, we will explore a range of potential use cases that can benefit from the vector database. These may include synonym discovery, concept mapping, and concept recommendation.

The name resolution service will be exposed as an OpenAPI-described API that takes a concept label as input, and return a list of matching entities, represented by a dictionary with the score and their ID curie, label, synonyms.

Vector databases

Name Creation GitHub stars Written in SDK for Query language/API* Implement vector functions Comment
Qdrant July 2020 ~14k Rust Python, JS, Rust, Go, .NET OpenAPI, gRPC cosine, euclid, dot Can be used as local standalone tool, in memory or persistent on disk, without to deploy a web service
Milvus October 2019 ~24k Go Python, JS, Java, Go OpenAPI ❓️ cosine, euclid, inner product aka. Zilliz cloud
Chroma October 2022 ~9k Python Python, JS OpenAPI ❓️
Weaviate March 2016 ~8k Go Python, JS, Java, Go GraphQL API cosine, euclid
pgvector April 2021 ~6.5k C Through Postgres SDK ❓️ SQL cosine, euclid, inner product, taxicab Integrated in PostgreSQL

*Query language/API specifies which type of query language or API can be used to query the information inside the vector database

All those products are Open Source, and they all propose a simple web UI to explore the vector database.

Most of them have a modern and simple API (apart from pgvector which lives within PostgreSQL)

Text embedding models

Reference benchmark for text embeddings models: https://huggingface.co/blog/mteb

Leaderboard: https://huggingface.co/spaces/mteb/leaderboard

Popular embedding models:

  • FlagEmbedding bge-large-en-v1.5
  • OpenAI text-embedding-ada-002
  • HuggingFace sentence-transformers/all-MiniLM-L6-v2
  • Jina AI jina-embeddings-v2-base-en
  • Cohere embed-english-v3.0

Benchmark dataset

To be defined.

Existing benchmarks for Vector databases:

Biomedical data Benchmark
Mapping issues in Name Resolution service

Preliminary results on the 19/01/2024 (Babel synonyms not fully loaded yet, missing files after Drug: gene, protein, organisms, pathway, umls): most issues seems to be resolved apart from "Rat" and "acp-044 dose a" (does not time out but no interesting results)

Run the project

Start services:

docker compose up -d

Get into the workspace container to run the loading scripts.

Download the Babel synonyms and load them in the vectordb:

make load

(experimental) Load PubDictionaries in pgvector:

python src/pubdict_load.py

Current limitations

  1. Current self-hosted vector database don't support multiple vectors for a single point. Which forces us to create different points for the different synonyms, and requires deduplication of the results when lookup. Which prevent us to properly use the limitfeature from the vectordb (if the 2 first results from the vectordb are from the same point, then we will return only 1 results, which will not match the limit of 2 asked by the user)

Possible solution would be to use postgres and pgvector, with 2 tables (one for embeddings, one for concept infos) but that would make the system much more complex than a JSON store.

Is there any self-hosted vectordb that can support multiple unnamed vectors for a single point? (Qdrant currently only supports multiple named vectors which does not fit our use-case)

  1. For really large datasets such as the Babel synonym dataset embedding can be quite CPU intensive. It took us ~18 weeks of CPU time to index 14 millions labels.
  2. To match the original NameResolution functionalities more work will need to be done to improve the order of the results (prefLabel matches should be more important than matches on synonyms, preference by prefix/biolink types, etc)

Documents

Introduction presentation: https://docs.google.com/presentation/d/1_nTMF-ltHvYbbvfUSDxSdBEb0Wm_yr_BvNNt-IvLKtc/edit

PubDictionaries experiment: https://docs.google.com/document/d/1nipvy2ZhZedmf5bjcUzcbGZIfN22V9KpZfO4eTXL89M/edit

Conclusion presentation: https://docs.google.com/presentation/d/1sJeuo4oegNmaMTrvCAWb0TZJZR9SGnYH-EFwTjf99lg/edit

Preprint biohackrxiv paper: http://preview.biohackrxiv.org/papers/bdda0f94-f526-4f35-8768-8faf62d731fa/paper.pdf

Demo API: https://concept-resolver.137.120.31.102.nip.io

About

πŸ”Ž A name resolution service for biomedical concepts, using vector databases and similarity search

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published