Comprehensive rare disease knowledge base integrating multiple authoritative databases with AI-powered semantic literature search.
RDEvidence integrates data from:
- 🧬 Orphanet - Expert rare disease database
- 🧪 MONDO - Disease ontology
- 📊 HPO - Human Phenotype Ontology
- 💊 MAxO - Medical Action Ontology
- 🔬 ClinVar - Genetic variant database
- 🏥 ClinicalTrials.gov - Clinical trials
- 📚 PubMed/PubTator3 - Biomedical literature (85,000+ papers)
- RAG-Powered Search: Semantic search across 85K+ papers using BioBERT embeddings
- Entity Annotation: Automatic highlighting of genes, diseases, chemicals, variants via PubTator3
- Multi-Database Integration: Unified search across rare disease resources
- Clinical Trials: Find relevant trials by disease, intervention, location
- Medical Actions: Evidence-based treatment recommendations
- Variant Analysis: ClinVar variant lookup and interpretation
RDEvidence/
├── frontend/ # Web interface
│ └── index.html # Main HTML file (PubTator3 integrated)
├── backend/ # Python Flask API
│ ├── complete_backend_pubtator.py # Main backend server
│ └── requirements.txt # Python dependencies
├── scripts/ # Database building utilities
│ ├── merge_clinvar_orpha_mondo_hpo_literature.py
│ ├── build_vectordb_from_merged.py
│ ├── test_vectordb.py
│ └── check_rag.py
└── docs/ # Documentation
└── setup.md
- Python 3.11+
- 8GB+ RAM (for vector database)
- 2GB+ disk space
- Clone the repository:
git clone https://github.com/wangjl99/RDEvidence.git
cd RDEvidence- Install Python dependencies:
cd backend
pip install -r requirements.txt- Set up the database:
Option A: Contact author for pre-built database
Option B: Build from source data
cd scripts
# Step 1: Merge data sources
python merge_clinvar_orpha_mondo_hpo_literature.py
# Step 2: Build vector database (4-6 hours)
python build_vectordb_from_merged.py
# Step 3: Verify
python test_vectordb.py- Start the backend:
cd backend
python complete_backend_pubtator.pyBackend will run at: http://localhost:5000
- Open the frontend:
cd frontend
# Open index.html in your browser
# Or serve with: python -m http.server 8000Update paths in backend/complete_backend_pubtator.py if needed:
# Database paths
CHROMA_DB_PATH = "./literature_vectordb"
DATA_DIR = "./data"Update API endpoint in frontend/index.html (line ~1380):
// For local development:
const API_BASE = 'http://localhost:5000';
// For production deployment:
const API_BASE = 'https://your-api-url.com';Current database contains:
- 85,762 papers with BioBERT embeddings
- Integrated data from Orphanet, MONDO, HPO, ClinVar
- Full PubMed abstracts with PubTator3 annotations
Database is built from:
master_literature_results_mysql_input.tsv(source data, not in repo)- Merged with ClinVar, Orphanet, MONDO, HPO ontologies
# Check if vector database is working
python scripts/check_rag.py
# Test backend endpoints
curl http://localhost:5000/diseases?query=Bardet-BiedlIf you use RDEvidence in your research, please cite:
@software{rdevidence2024,
title = {RDEvidence: AI-Powered Rare Disease Knowledge Platform},
author = {Wang, Jing},
year = {2024},
url = {https://github.com/wangjl99/RDEvidence}
}MIT License - see LICENSE file for details
Contributions welcome! Please open an issue first to discuss proposed changes.
- GitHub: @wangjl99
- Repository: https://github.com/wangjl99/RDEvidence
- Orphanet for rare disease data
- NCBI for PubMed and PubTator3
- ClinVar for variant data
- BioBERT team for embedding model
- MONDO, HPO, MAxO ontology teams
Last Updated: December 2024