Evaluate LLMs' capability at performing differential diagnosis for rare genetic diseases through medical-vignette-like prompts created with phenopacket2prompt.
To systematically assess and evaluate an LLM's ability to perform differential diagnostics tasks, we employed prompts programatically created with phenopacket2prompt, thereby avoiding any patient privacy issues. The original data are phenopackets located at phenopacket-store. A programmatic approach for scoring and grounding results is also developed, made possible thanks to the ontological structure of the Mondo Disease Ontology.
Two main analyses are carried out:
- A benchmark of some large language models against the state of the art tool for differential diagnostics, Exomiser. The bottom line, Exomiser clearly outperforms the LLMs as per our preprint, data at this zenodo.
- A comparison of gpt-4o's and Meditron3-70B ability to carry out differential diagnosis when prompted in 10 different languages, results at this medRxiv or zenodo (see also this related link for some more data).
poetry install
Note: If you are unfamiliar with poetry, just
pip install .
and then activate the environment, omittingpoetry run
from the following instructions.
export OPENAI_API_KEY=<your key>
poetry run curategpt ontology index --index-fields label,definition,relationships -p stagedb -c ont_mondo -m openai: sqlite:obo:mondo
Where the embedding model is selected via the -m
argument. For non openAI models, we refer to the CurateGPT documentation.
cp data/config/default.yaml data/config/<your_model>.yaml
poetry run malco evaluate --config data/config/meditron3-70b.yaml
poetry run malco plot --config data/config/meditron3-70b.yaml
poetry run malco combine --dir data/results --lang ALL
poetry run malco combine --dir data/results