Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add draft docs for MP HP mappings #963

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 126 additions & 0 deletions docs/reference/mappings/mp_hp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
## MP-HP mappings

The landscape of MP-HP mappings is somewhat complicated, as they are generated / maintained by multiple resources, such as [Monarch Initiative](https://monarchinitiative.org/) and [MGI](https://www.informatics.jax.org/). Here, you will learn the different ways HP-MP mappings are generated, and how to obtain them.

There are two major classes of MP-HP mappings: Semantic Entity Mappings (SEM) and Semantic Similarity Mappings (SSM).

_Semantic Entity Mappings (SEM)_: SEMs are correspondences between two entities which are qualified by explicit semantic relations (e.g. `skos:exactMatch`, the most typical, or `skos:broadMatch` etc, or more exotic ones, like `semapv:crossSpeciesExactMatch`).
For example, HP:0001640 (Cardiomegaly) and MP:0000274 (enlarged heart) map to each other using the `semapv:crossSpeciesExactMatch`.
We typically represent an number of SEMs using a [SSSOM file](https://mapping-commons.github.io/sssom/), for example:

| subject_id | predicate_id | object_id | mapping_justification | subject_label | object_label |
| --- | --- | --- | --- | --- | --- |
| HP:0009124 | skos:exactMatch | MP:0000003 | semapv:LexicalMatching | Abnormal adipose tissue morphology | abnormal adipose tissue morphology |
| HP:0008551 | skos:exactMatch | MP:0000018 | semapv:LexicalMatching | Microtia | small ears |
| HP:0000411 | skos:exactMatch | MP:0000021 | semapv:LexicalMatching | Protruding ear | prominent ears |

_Semantic Similarity Mappings (SSM)_: SSMs are correspondences that are, rather than qualified by an explicit mapping relation, qualified by a similarity score between 0 and 1. For example HP:0001640 (Cardiomegaly) and MP:0000274 (enlarged heart) have, depending on the semantic similarity algorithm used, a semantic similarity of 1.0 (entirely similar).
We typically represent a number of SSMs as a [Semantic Similarity Profile](https://incatools.github.io/ontology-access-kit/datamodels/similarity/index.html), for example:

| subject_id | subject_label | subject_source | object_id | object_label | object_source | ancestor_id | ancestor_label | ancestor_source | object_information_content | subject_information_content | ancestor_information_content | jaccard_similarity | cosine_similarity | dice_similarity | phenodigm_score |
| ---------- | --------------------------- | -------------- | ---------- | ---------------------------------------------------------- | ------------- | -------------- | -------------- | --------------- | -------------------------- | --------------------------- | ---------------------------- | ------------------ | ----------------- | --------------- | --------------- |
| HP:0100624 | Corpus cavernosum sclerosis | | MP:0009103 | abnormal penile bone morphology | | UPHENO:0003055 | | | | | 2.416 | 0.203 | | | 0.701 |
| HP:0100624 | Corpus cavernosum sclerosis | | MP:0011528 | abnormal placental labyrinth villi branching morphogenesis | | UPHENO:0003055 | | | | | 2.416 | 0.214 | | | 0.720 |
| HP:0100624 | Corpus cavernosum sclerosis | | MP:0003205 | testicular atrophy | | UPHENO:0002682 | | | | | 5.394 | 0.191 | | | 1.015 |
| HP:0100624 | Corpus cavernosum sclerosis | | MP:0009256 | enlarged corpus epididymis | | UPHENO:0002523 | | | | | 3.686 | 0.234 | | | 0.930 |

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider a different set of examples, could you find one that had in it an exact or nearly exact match plus more distant matches. Looking at this I would not want to present any of these as options to a user. At least swap out one of the examples for abnormal penis morphology (MP:0005187) which is the closest MP term I could find

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the assumption here that these are all using 'exact match' as the basis for the scores?

In the following, we describe what kind of MP-HP mappings exist and where to get them.

!!! warning A Human phenotype is Not a Mammalian phenotype!?!

It is critical to understand that the community has made the pragmatic decision to not subsume the HPO (which is about human phenotypes) under the MP ontology (which is scoped as Mammalian), despite the fact that humans are clearly Mammals.
The rationale is that in the context of [uPheno](https://www.biorxiv.org/content/10.1101/2024.09.18.613276v1), the community sought to avoid any two phenotype ontologies "bleeding into each other".
Understanding this is beyond the scope of this article, but in summary, the problem is that manual phenotype classification has always been fairly ideosyncractic (e.g. some groups subsume morphological abnormalities under physiological ones), and the community wanted to avoid the one ontology inheriting the classification philosophy of another.

!!! info What is a "matching" phenotype?

Before you read on, you should remember that matching phenotypes is in many ways a "dark art".
What does it mean for two phenotypes to "match"?
The same quality (weigh, size, amount) is a characteristic of a (a) homologous (anatomical) structure or (b) of a "conserved" cellular pathway / physiology or (c) of a functionally equivalent behaviour (and similar)?
There are good and bad reasons for each of these, and many use cases where each of these could be relevant.
The phenotype ontology community is driven heavily by genetics concerns, so for them, it is crucial that a match has a reasonable chance correspond to _shared molecular underpinnings_, i.e. corresponding phenotypes are caused by the same (ortholog) genes.
Therefore, the most beloved HP-MP matches are (1) and (2), while (3) is a bit of a grey area (_abnormal courting behaviour_ in mice can have a considerably different, even disjoint, genetic profile associated with it compared to _abnormal courting behaviour_ in frogs).

!!! warning If you are looking for shared genetic underpinnings most cross-species mappings are wrong.

One of the most misunderstood aspects about phenotype matching is that a match magically reveals candidate genes across species.
_This is generally wrong_.
A worm tail is not the same as a mouse tail, even if we call it the same, and the genes involved in letting the mouse tail grow longer are likely not quite the same as those letting the tail of the _C. Elegans_ tail grow longer.
However, it [has been proven](https://pmc.ncbi.nlm.nih.gov/articles/PMC3649640/) that there _is_ indeed a lot of signal, especially between species that are not too distant taxonomically such as mice and humans, and this signal can be used, effectively, for a variety of applications [such as clinical diagnostics](https://pmc.ncbi.nlm.nih.gov/articles/PMC7230372/).
It is best we understand cross-species phenotype matches as _candidates for shared molecular underpinnings_ - and then expect them, in most cases, to fail as such.

### HP-MP Semantic Entity Mappings

!!! warning "THE" HP-MP mapping does not exist

There is no _single_ HP-MP mapping out there.
Several groups have created mappings, and there are some places where those mappings are collected and collated, but it is still important to understand their origin.

There are four approaches that are currently being used to determine MP-HP Semantic Entity Mappings mappings.

1. [Manual curation](#manual): Usually for specific projects curators have been funded to manually link phenotypes in one ontology to the other.
2. [Lexical matching](#lexical): Without considering the ontologies, the labels of the phenotype terms in HP and MP are analysed, preprocessed and, if possible, matched.
3. [Logical matching](#logical): In essence, if two phenotype terms have equivalent logical definitions, they are considered a logical match.
4. [AI-assisted matching](#llm): LLMs have a good "grasp" of language and can bridge the gap between manual curation and lexical matching by their ability to handle lexical variation much better.

We will discuss all four mappings types as follows and where to get them from.

<a id="manual"></a>

#### Manually curated HP-MP mappings

Currently, most manual HP-MP mapping efforts are, or should be, coordinated in the [Mouse-Human Ontology Mapping Initiative (MHMI)](https://github.com/mapping-commons/mh_mapping_initiative). The two most important manual curation efforts are the MGI and the IMPC HP-MP mappings.

_The MGI HP-MP mappings_: Driven by [Mouse Genome Informatics (MGI)](https://www.informatics.jax.org/) at JAX, these mappings are curated manually to enable search on the MGI webpages ([source](https://doi.org/10.1093/genetics/iyae031)).
There are currently more than 1500 mappings, curated [here](https://github.com/mapping-commons/mh_mapping_initiative/blob/master/mappings/mp_hp_mgi_all.sssom.tsv).

_The IMPC HP-MP mappings_: Driven by the International Mouse Phenotyping Consortium (IMPC), these mappings are manually curated for the purpose of data integration and analysis. An example for the Eye Morphology related mappings can be [found here](https://github.com/mapping-commons/mh_mapping_initiative/blob/master/mappings/mp_hp_eye_impc.sssom.tsv).

<a id="lexical"></a>

#### Lexical matching

Lexical mappings are those that are determined by determinstic lexical rules, for example "exact lexical matches" after applying some explicit preprocessing rules (such as whitespace removal or case normalisation).

There are two pipelines currently generating uPheno mappings: a custom pipeline implemented in python (which predates OAK), and an OAK lexmatch pipeline. Both live in the [uPheno pipeline repo](https://github.com/obophenotype/upheno-dev/blob/master/src/ontology/upheno.Makefile), but are probably going to be moved, soon, to the [uPheno repo](https://github.com/obophenotype/upheno/blob/master/src/ontology/upheno.Makefile).

_The Custom lexical matching pipeline_ is essentially [a python function (`generate_mapping_files(..)`) that combines a bunch of preprocessing with a SSSOM export](https://github.com/obophenotype/upheno-dev/blob/c12d11d973d622296c9750903b2423791fa40e1e/src/scripts/lib.py#L2056). There are a lot of custom rules for labels in this script, so it is unclear how much of it is covered by the much more standardised OAK pipeline.

_The OAK lexical matching pipeline_ runs with every uPheno release and simply executes OAK lexmatch (see `$(TMPDIR)/upheno-species-lexical-oak.sssom.tsv` goal in https://github.com/obophenotype/upheno-dev/blob/master/src/ontology/upheno.Makefile).
The matching rules are maintained in a special [rules file](https://github.com/obophenotype/upheno-dev/blob/master/src/ontology/config/upheno-match-rules.yaml).\
OAK lexmatch leverages synonyms and preprocessing to compute candidate mapping across all classes in an ontology in an efficient manner.

<a id="logical"></a>

#### Logical matching

The uPheno release system publishes both up to date logical and lexical matches.
In the past, the reasoner was used to computer logical matches; at the moment, logical matches are determined through a quick structural check.
Both come down to the same thing:

!!! info When are two phenotypes the considered "logical matches"?

Two phenotypes P1 and P2 are considered logical matches if their EQ-logical definitions are equivalent under cross-species conflation assumption.
This means, in practical terms: if the EQ logical definition refers to the same characteristics and the bearer of P1 corresponds to the bearer of P2 either by virtue of being identical (e.g. the same GO biological process) or of being cross-species exact matches (e.g. two species specific anatomical classes are exact matches to the same Uberon class), we consider P1 and P2 a "logical match".

Note that you do not need a reasoner for this task: you can just map all bearer entities to their species-indepdent integration ontologies, and then just match the bearer entities exactly using string equivalence.

<a id="llm"></a>

#### AI-assisted matching

There are no active pipelines that generate production ready mappings using LLMs.
We have plan to use [OntoGPTs MapperGPT](https://monarch-initiative.github.io/ontogpt/functions/#categorize-mappings) and OAKs [validate-mappings](https://incatools.github.io/ontology-access-kit/cli.html#runoak-validate-mappings) command to scale our mapping efforts in the near future (early 2025).

### Mapping availability

| Mapping set | Maintainers | Format | Description |
| ----------- | ----------- | ------ | ----------- |
| [HP-MP (manually curated)](https://github.com/mapping-commons/mh_mapping_initiative/tree/master/mappings) | Monarch Initiative, MGI, IMPC and others. | [SSSOM](https://mapping-commons.github.io/sssom) | Mapping sets connected HP to MP terms curated manually by multiple organisations. |
| [uPheno integration mappings](https://data.monarchinitiative.org/mappings/latest/upheno-species-independent.sssom.tsv) | Monarch Initiative | [SSSOM](https://mapping-commons.github.io/sssom) | This mapping set is generated as part of the uPheno pipeline and links HPO and MP terms (among others) with species-neutral phenotype terms in uPheno. |
| uPheno cross-species mappings | Monarch Initiative | [SSSOM](https://mapping-commons.github.io/sssom) | This mapping set is generated as part of the uPheno pipeline and links HPO and MP terms (among others) to each other based on lexical matching, logical matching and manual curation. |
| uPheno lexical mappings (OAK) | Monarch Initiative | [SSSOM](https://mapping-commons.github.io/sssom) | This mapping set is generated as part of the uPheno pipeline and links HPO and MP terms (among others) to each other based on simple lexical matching using OAK. |
| uPheno lexical mappings (Custom) | Monarch Initiative | [SSSOM](https://mapping-commons.github.io/sssom) | This mapping set is generated as part of the uPheno pipeline and links HPO and MP terms (among others) to each other based on simple lexical matching. |
| uPheno logical mappings | Monarch Initiative | [SSSOM](https://mapping-commons.github.io/sssom) | This mapping set is generated as part of the uPheno pipeline and links HPO and MP terms (among others) to each other based solely on their EQ logical definition. |
| uPheno Mapper GPT mappings | Monarch Initiative | [SSSOM](https://mapping-commons.github.io/sssom) | TBD. |
| [PhenIO semantic similarity matches](https://data.monarchinitiative.org/semantic-similarity/latest/index.html) | Monarch Initiative | [OAK Semantic Similarity](https://w3id.org/oak/similarity) | These are the semantic similarity matches computed over PhenIO, which contains uPheno. |
Loading