-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate mappings from individual classes to classes in other ontologies #117
Comments
The issue stems from a faulty SPARQL query that returns a paginated list of mappings for a particular ontology (to ALL other ontologies) or a list of mappings from a given class in an ontology to classes in other ontologies:
The problem with this query in that it doesn’t account for the latest (the highest id with the status RDF) submissions. Instead, it queries ALL of them, resulting in many duplicate/irrelevant mappings. A query below yields the IDs of all the LATEST submissions:
However, combining these two queries isn't trivial. |
Alternate solutions explored:
Both of these do work, but they slow the original query down to a halt. There are over 1200 IDs that are added inside this filter.
|
Here is a version of the original query corrected with
|
This is the method that generates the faulty query: |
OK, here's another approach. Add an attribute to every graph (well, probably for every graph, putting it in the metadata graph or similar) that indicates whether it is the most recent submission for that ontology.
Now, when running the main query, you don't have to filter every WHERE evaluation with a FILTER against 1200 graphs. Instead, you just test whether the attribute is true. And you can perform that test before the mapping query is performed (would that be an outer WHERE clause?), so the mapping query only gets performed against the most recent graphs (instead of getting filtered out after running the query). That should be extremely fast to run the main query on the fly, it's running a lot fewer mapping queries. The submission graph attributes could all be maintained in an entirely separate graph, if we want to avoid adding an attribute to each graph. (It is more like metadata than content, so maybe it needs to be in a metadata graph.) But there has to be one entry for every submission that's in the triple store. |
Another side effect of this issue is that because mappings from older submissions are returned, if a term had been removed between an earlier and later submission, the mappings to that term are still materialized, resulting in broken links leading to the term in question. This issue was reported in a separate ticket: ncbo/ontologies_api#85. |
As a documentation point, the original BioPortal design assumed that only the latest submission graphs will be stored in the triple store, so the original mappings code was not written to filter out multiple submissions. As the size of the data grew, we had discovered that deleting previous submissions' graphs from 4store was highly resource intensive, and the CRON job responsible for those deletions had been paused. When this bug was discovered, I had made a number of attempts to modify the underlying SPARQL query to filter out orphan data. Unfortunately, none of my experiments (see above) yielded a performant result. Once we move to AllegroGraph, we hope that its scalable backend will allow us to resume the job of deleting the orphan submission graphs, which will automatically alleviate this issue. |
Issue #115 addressed the mapping counts between ontologies being reported higher than the actual counts. An issue still remains, where mappings from individual classes in an ontology to classes in other ontologies appear with multiple duplicate entries.
In BioPortal UI, this behavior is evident when browsing individual class mappings (vs the global “Mappings” tab for each ontology).
The bug affects these cases:
a) mappings from an individual ontology to ALL other ontologies
b) mappings from a class within an ontology to the mapped classes from ALL other ontologies
For example:
https://bioportal.bioontology.org/ontologies/DOID?p=classes&conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0001062#mappings
Click on the “Class Mappings (158)” tab for the “anatomy” class and scroll all the way down. You will see a number of duplicate entries there. For example: “Mapping of Drug Names, ICD-11 and MeSH 2021” or “Intelligence Task Ontology” or four identical mappings to “Mapping of Epilepsy Ontologies”.
The text was updated successfully, but these errors were encountered: