Disease Pathophysiology Knowledge Base FOR DEMO PURPOSES
This repo contains a mostly automated demo KB of diseases, pathophysiology, treatments, etiology etc generated using DRAGON-AI/CurateGPT.
The KB is created via a cycle:
- Human expert creates one or two seed entries
- New entries are created from latent knowledgebase of LLM
- Pubmed is searched for support/refute evidence on a per-assertion basis
- LLM acts as critic guided by human to constantly refine
https://monarch-initiative.github.io/dppkb
Click on "Diseases" to browse the "Knowledge Base". You will see a highly generic rendering of auto-generated disease entries.
This is an experiment in using CurateGPT for de-novo human-driven Knowledge Base cuation.
The general workflow is:
- A human writes some sample YAML files for a few entries
- the schema can be invented "on the fly"
- Iterate using claude.ai
- ask it to suggest other fields
- use as a template to create more
- Save as a .yaml file
- Iterate with curate-gpt
complete
command will generate a new entryciteseek
command will add support/refute evidence from pubmedupdate
command will enrich specific fieldsreview
command will use LLM as a critic and suggest changes
- kb/dppkb.yaml - main KB
Run
make index
This should be run periodically - it makes a local ChromaDB that will be used for RAG
Note: this loads a pre-processed version that has the evidence removed; we want to hide this when doing RAG as we want to avoid publication hallucination.
Run this:
make tmp/complete-Tuberculosis.yaml
This uses RAG/DRAGON-AI to make a candidate entry. You can then copy this into the kb/dppkb.yaml, or you can manually tweak it, or ask claude to tweak it.
The idea is that as the KB is incrementally built up with high quality examples, there will be less need for manual tweaking, RAG will be good enough.
Also recall we can enhance in future steps
NOTE: This step does not use the pubmed directly. We are relying on the fact that the LLM has already ingested and compressed all the literature and can do a pretty good first-pass job at re-exporting that in any format we like. It doesn't have to be perfect though, subsequent steps are designed to refine this.
make tmp/with-evidence.yaml
This with run CurateGPT citeseek
over all assertions, if there is no evidence
tag it will
query pubmed for supporting/refuting evidence.
It is recommended to periodically inspect the file wearing a lead curator role, and to ask for reviews.
Either global reviews:
curategpt review --model gpt-4o -p db -c disease "{}" -t patch --primary-key name > tmp/review.patch.yaml
Or focused, e.g. if you want pathophysiology
to be fleshed out:
curategpt -vv review --model gpt-4o -p db -c disease "{}" -Z pathophysiology -P name -t patch --primary-key name --rule "include as many mechanisms and molecular steps as you can" > tmp/pathophys-review.yaml
The result is a patch file, This can be manually examined, edited, and applied:
curategpt apply-patch --patch tmp/patch.yaml --primary-key name kb/dppkb.yaml > tmp/patched.kb.yaml
Do a diff then move it
there are different ways to write YAML. Ensure the kb representation is normalized:
make normalize
Currently we use labels not IDs as these are easier for humans reviewing the YAML, and for LLMs.
Grounding is expected to be trivial and highly reliable, will add a simple mappings to every entry.
TODO
make app
This will create a streamlit app where you can chat with the KB, visualize clusters, etc.
Ask a question:
See results clustered:
results: