Skip to content

OHNLP/RDEvidence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RDEvidence - Rare Disease Knowledge Platform

Comprehensive rare disease knowledge base integrating multiple authoritative databases with AI-powered semantic literature search.

🎯 Overview

RDEvidence integrates data from:

  • 🧬 Orphanet - Expert rare disease database
  • 🧪 MONDO - Disease ontology
  • 📊 HPO - Human Phenotype Ontology
  • 💊 MAxO - Medical Action Ontology
  • 🔬 ClinVar - Genetic variant database
  • 🏥 ClinicalTrials.gov - Clinical trials
  • 📚 PubMed/PubTator3 - Biomedical literature (85,000+ papers)

✨ Features

  • RAG-Powered Search: Semantic search across 85K+ papers using BioBERT embeddings
  • Entity Annotation: Automatic highlighting of genes, diseases, chemicals, variants via PubTator3
  • Multi-Database Integration: Unified search across rare disease resources
  • Clinical Trials: Find relevant trials by disease, intervention, location
  • Medical Actions: Evidence-based treatment recommendations
  • Variant Analysis: ClinVar variant lookup and interpretation

📁 Repository Structure

RDEvidence/
├── frontend/                  # Web interface
│   └── index.html            # Main HTML file (PubTator3 integrated)
├── backend/                   # Python Flask API
│   ├── complete_backend_pubtator.py  # Main backend server
│   └── requirements.txt      # Python dependencies
├── scripts/                   # Database building utilities
│   ├── merge_clinvar_orpha_mondo_hpo_literature.py
│   ├── build_vectordb_from_merged.py
│   ├── test_vectordb.py
│   └── check_rag.py
└── docs/                      # Documentation
    └── setup.md

🚀 Quick Start

Prerequisites

  • Python 3.11+
  • 8GB+ RAM (for vector database)
  • 2GB+ disk space

Installation

  1. Clone the repository:
git clone https://github.com/wangjl99/RDEvidence.git
cd RDEvidence
  1. Install Python dependencies:
cd backend
pip install -r requirements.txt
  1. Set up the database:

⚠️ Note: The vector database (~500MB-1GB) is not included due to GitHub size limits.

Option A: Contact author for pre-built database

Option B: Build from source data

cd scripts

# Step 1: Merge data sources
python merge_clinvar_orpha_mondo_hpo_literature.py

# Step 2: Build vector database (4-6 hours)
python build_vectordb_from_merged.py

# Step 3: Verify
python test_vectordb.py
  1. Start the backend:
cd backend
python complete_backend_pubtator.py

Backend will run at: http://localhost:5000

  1. Open the frontend:
cd frontend
# Open index.html in your browser
# Or serve with: python -m http.server 8000

🔧 Configuration

Backend Configuration

Update paths in backend/complete_backend_pubtator.py if needed:

# Database paths
CHROMA_DB_PATH = "./literature_vectordb"
DATA_DIR = "./data"

Frontend Configuration

Update API endpoint in frontend/index.html (line ~1380):

// For local development:
const API_BASE = 'http://localhost:5000';

// For production deployment:
const API_BASE = 'https://your-api-url.com';

📊 Database Information

Current database contains:

  • 85,762 papers with BioBERT embeddings
  • Integrated data from Orphanet, MONDO, HPO, ClinVar
  • Full PubMed abstracts with PubTator3 annotations

Database is built from:

  • master_literature_results_mysql_input.tsv (source data, not in repo)
  • Merged with ClinVar, Orphanet, MONDO, HPO ontologies

🧪 Testing

# Check if vector database is working
python scripts/check_rag.py

# Test backend endpoints
curl http://localhost:5000/diseases?query=Bardet-Biedl

📝 Citation

If you use RDEvidence in your research, please cite:

@software{rdevidence2024,
  title = {RDEvidence: AI-Powered Rare Disease Knowledge Platform},
  author = {Wang, Jing},
  year = {2024},
  url = {https://github.com/wangjl99/RDEvidence}
}

📄 License

MIT License - see LICENSE file for details

🤝 Contributing

Contributions welcome! Please open an issue first to discuss proposed changes.

📧 Contact

🙏 Acknowledgments

  • Orphanet for rare disease data
  • NCBI for PubMed and PubTator3
  • ClinVar for variant data
  • BioBERT team for embedding model
  • MONDO, HPO, MAxO ontology teams

Last Updated: December 2024

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors