Duplicate mappings from individual classes to classes in other ontologies #117

mdorf · 2021-04-30T23:51:47Z

Issue #115 addressed the mapping counts between ontologies being reported higher than the actual counts. An issue still remains, where mappings from individual classes in an ontology to classes in other ontologies appear with multiple duplicate entries.

In BioPortal UI, this behavior is evident when browsing individual class mappings (vs the global “Mappings” tab for each ontology).

The bug affects these cases:

a) mappings from an individual ontology to ALL other ontologies
b) mappings from a class within an ontology to the mapped classes from ALL other ontologies

For example:

https://bioportal.bioontology.org/ontologies/DOID?p=classes&conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0001062#mappings

Click on the “Class Mappings (158)” tab for the “anatomy” class and scroll all the way down. You will see a number of duplicate entries there. For example: “Mapping of Drug Names, ICD-11 and MeSH 2021” or “Intelligence Task Ontology” or four identical mappings to “Mapping of Epilepsy Ontologies”.

mdorf · 2021-04-30T23:58:51Z

The issue stems from a faulty SPARQL query that returns a paginated list of mappings for a particular ontology (to ALL other ontologies) or a list of mappings from a given class in an ontology to classes in other ontologies:

SELECT DISTINCT ?s1 ?s2 ?g ?source ?o
WHERE {
  {
    GRAPH <http://data.bioontology.org/ontologies/MONDO/submissions/41> {
        ?s1 <http://bioportal.bioontology.org/ontologies/umls/cui> ?o .
    }
    GRAPH ?g {
        ?s2 <http://bioportal.bioontology.org/ontologies/umls/cui> ?o .
    }
    BIND ('CUI' AS ?source)
  }
  UNION
  {
    GRAPH <http://data.bioontology.org/ontologies/MONDO/submissions/41> {
        ?s1 <http://data.bioontology.org/metadata/def/mappingSameURI> ?o .
    }
    GRAPH ?g {
        ?s2 <http://data.bioontology.org/metadata/def/mappingSameURI> ?o .
    }
    BIND ('SAME_URI' AS ?source)
  }
  UNION
  {
    GRAPH <http://data.bioontology.org/ontologies/MONDO/submissions/41> {
        ?s1 <http://data.bioontology.org/metadata/def/mappingLoom> ?o .
    }
    GRAPH ?g {
        ?s2 <http://data.bioontology.org/metadata/def/mappingLoom> ?o .
    }
    BIND ('LOOM' AS ?source)
  }
  UNION
  {
    GRAPH <http://data.bioontology.org/ontologies/MONDO/submissions/41> {
        ?s1 <http://data.bioontology.org/metadata/def/mappingRest> ?o .
    }
    GRAPH ?g {
        ?s2 <http://data.bioontology.org/metadata/def/mappingRest> ?o .
    }
    BIND ('REST' AS ?source)
  }
  FILTER ((?s1 != ?s2) || (?source = 'SAME_URI'))
  FILTER (!STRSTARTS(str(?g),'http://data.bioontology.org/ontologies/MONDO'))
} 
OFFSET 20 LIMIT 20

The problem with this query in that it doesn’t account for the latest (the highest id with the status RDF) submissions. Instead, it queries ALL of them, resulting in many duplicate/irrelevant mappings.

A query below yields the IDs of all the LATEST submissions:

SELECT (CONCAT(?ontology, "/submissions/", (MAX(?submissionId))) as ?id)
WHERE { 
  ?id <http://data.bioontology.org/metadata/ontology> ?ontology .
  ?id <http://data.bioontology.org/metadata/submissionId> ?submissionId .
  ?id <http://data.bioontology.org/metadata/submissionStatus> ?submissionStatus .
  ?submissionStatus <http://data.bioontology.org/metadata/code> "RDF" . 
  OPTIONAL { 
    ?id <http://data.bioontology.org/metadata/ontology> ?ontJoin .  
  } 
  OPTIONAL { 
    ?ontJoin <http://data.bioontology.org/metadata/viewOf> ?viewOf .  
  } 
  FILTER(!BOUND(?viewOf)) 
}
GROUP BY ?ontology

However, combining these two queries isn't trivial.

mdorf · 2021-05-01T00:01:28Z

Alternate solutions explored:

Running the second query separately in code, and then adding a large FILTER IN (or FILTER (... || ...) block to the first query:

FILTER(?g in (<http://data.bioontology.org/ontologies/ICO/submissions/16> , <http://data.bioontology.org/ontologies/DRPSNPTO/submissions/1>, ...))
OR
FILTER (?g = <http://data.bioontology.org/ontologies/ICO/submissions/16> || ?g = <http://data.bioontology.org/ontologies/DRPSNPTO/submissions/1> || ?g = ...)

Both of these do work, but they slow the original query down to a halt. There are over 1200 IDs that are added inside this filter.

Running the original query as is and then filter out the mappings from old submissions in code. This performs well but breaks the pagination, which is done in SPARQL itself.
Combining the two SPARQL queries as I would do in SQL. I get errors from 4store: SubSELECTs are not implemented.

mdorf · 2021-05-01T00:03:18Z

Here is a version of the original query corrected with FILTER clauses, which produces the correct results but is extremely slow:

SELECT DISTINCT ?s1 ?s2 ?g ?source ?o
WHERE {
  {
    GRAPH <http://data.bioontology.org/ontologies/MONDO/submissions/41> {
        ?s1 <http://data.bioontology.org/metadata/def/mappingLoom> ?o .
    }
    GRAPH ?g {
        ?s2 <http://data.bioontology.org/metadata/def/mappingLoom> ?o .
    }
    BIND ('LOOM' AS ?source)
  }
  FILTER ((?s1 != ?s2) || (?source = 'SAME_URI'))
  FILTER (!STRSTARTS(str(?g),'http://data.bioontology.org/ontologies/MONDO'))
  FILTER (?g = <http://data.bioontology.org/ontologies/ICO/submissions/16> || ?g = <http://data.bioontology.org/ontologies/DRPSNPTO/submissions/1> || ?g = <http://data.bioontology.org/ontologies/GEOSPECIES/submissions/2> || ?g = <http://data.bioontology.org/ontologies/TEO/submissions/4> || ?g = <http://data.bioontology.org/ontologies/OMV/submissions/1> || ?g = <http://data.bioontology.org/ontologies/TMO/submissions/13> || ?g = <http://data.bioontology.org/ontologies/OPMI/submissions/16> || ?g = <http://data.bioontology.org/ontologies/OFSMR/submissions/19> || ?g = <http://data.bioontology.org/ontologies/MOOCCUADO/submissions/2> || ?g = <http://data.bioontology.org/ontologies/DISTEST/submissions/2> || ?g = <http://data.bioontology.org/ontologies/LIFO/submissions/1> || ?g = <http://data.bioontology.org/ontologies/CORON/submissions/30> || ?g = <http://data.bioontology.org/ontologies/MATRCOMPOUND/submissions/1> || ?g = <http://data.bioontology.org/ontologies/AGRO/submissions/3> || ?g = <http://data.bioontology.org/ontologies/HEIO/submissions/17> || ?g = <http://data.bioontology.org/ontologies/GAMUTS/submissions/23> || ?g = <http://data.bioontology.org/ontologies/EGO/submissions/1> || ?g = <http://data.bioontology.org/ontologies/CIDIO_V1/submissions/2> || ?g = <http://data.bioontology.org/ontologies/ISO19115ROLES/submissions/6> || ?g = <http://data.bioontology.org/ontologies/IDO/submissions/13> || ?g = <http://data.bioontology.org/ontologies/MARC-RELATORS/submissions/1> || ?g = <http://data.bioontology.org/ontologies/CDPEO/submissions/1> || ?g = <http://data.bioontology.org/ontologies/ICD10-CN/submissions/6> || ?g = <http://data.bioontology.org/ontologies/FB-CV/submissions/29> || ?g = <http://data.bioontology.org/ontologies/ILLNESSINJURY/submissions/1> || ?g = <http://data.bioontology.org/ontologies/NIFDYS/submissions/16> || ?g = <http://data.bioontology.org/ontologies/RCTV2/submissions/1> || ?g = <http://data.bioontology.org/ontologies/EMAPA/submissions/41> || ?g = <http://data.bioontology.org/ontologies/ONTOAD/submissions/2> || ?g = <http://data.bioontology.org/ontologies/TMA/submissions/1> || ?g = <http://data.bioontology.org/ontologies/HIVMT/submissions/6> || ?g = <http://data.bioontology.org/ontologies/HIVO004/submissions/27> || ?g = <http://data.bioontology.org/ontologies/ONTOBIOTOPE51/submissions/2> || ?g = <http://data.bioontology.org/ontologies/READMISSIONDIAB/submissions/1> || ?g = <http://data.bioontology.org/ontologies/SIO/submissions/86> || ?g = <http://data.bioontology.org/ontologies/PSIMOD/submissions/22>)   
}

mdorf · 2021-05-01T00:19:43Z

This is the method that generates the faulty query:
https://github.com/ncbo/ontologies_linked_data/blob/master/lib/ontologies_linked_data/mappings/mappings.rb#L131

graybeal · 2021-05-01T00:43:17Z

OK, here's another approach.

Add an attribute to every graph (well, probably for every graph, putting it in the metadata graph or similar) that indicates whether it is the most recent submission for that ontology.

This can be set/reset via a script, using the 'latest ID of all submissions' query to find those graphs. (To reset the less recent attributes, reset any graph that is not in the list of most recent submissions, or more efficiently, for each ontology, clear the attribute all submissions that match that ontology but aren't the latest.) I prefer this process, as it could be run daily, so only manually submitted ontologies are getting old mappings, and only for the rest of that day.
Or it can be set/reset every time an ontology submission is processed—reset the attribute of previous submission(s) to false, set the attribute of the current submission to true. (Even if you occasionally miss resetting a previous attribute, we can spot it when we see a doubled mapping and delete it manually.)

Now, when running the main query, you don't have to filter every WHERE evaluation with a FILTER against 1200 graphs. Instead, you just test whether the attribute is true. And you can perform that test before the mapping query is performed (would that be an outer WHERE clause?), so the mapping query only gets performed against the most recent graphs (instead of getting filtered out after running the query).

That should be extremely fast to run the main query on the fly, it's running a lot fewer mapping queries.

The submission graph attributes could all be maintained in an entirely separate graph, if we want to avoid adding an attribute to each graph. (It is more like metadata than content, so maybe it needs to be in a metadata graph.) But there has to be one entry for every submission that's in the triple store.

mdorf · 2022-05-18T23:07:51Z

Another side effect of this issue is that because mappings from older submissions are returned, if a term had been removed between an earlier and later submission, the mappings to that term are still materialized, resulting in broken links leading to the term in question. This issue was reported in a separate ticket: ncbo/ontologies_api#85.

mdorf · 2022-05-19T00:01:58Z

As a documentation point, the original BioPortal design assumed that only the latest submission graphs will be stored in the triple store, so the original mappings code was not written to filter out multiple submissions. As the size of the data grew, we had discovered that deleting previous submissions' graphs from 4store was highly resource intensive, and the CRON job responsible for those deletions had been paused.

When this bug was discovered, I had made a number of attempts to modify the underlying SPARQL query to filter out orphan data. Unfortunately, none of my experiments (see above) yielded a performant result. Once we move to AllegroGraph, we hope that its scalable backend will allow us to resume the job of deleting the orphan submission graphs, which will automatically alleviate this issue.

…m (#117)

mdorf added bug ready labels Apr 30, 2021

mdorf self-assigned this Apr 30, 2021

graybeal added KB Aim1 Maintenance KB1 Mx:Docs/Ilities labels Nov 23, 2021

graybeal mentioned this issue Nov 23, 2021

Duplicate LOOM mappings returned from /classes/:cls/mappings endpoint ncbo/ontologies_api#67

Open

pkalita-lbl mentioned this issue Apr 14, 2022

Add get_sssom_mappings_by_curie to BioportalImplementation class INCATools/ontology-access-kit#17

Merged

mdorf mentioned this issue May 18, 2022

Broken links in objects returned by mappings endpoint ncbo/ontologies_api#85

Closed

galviset referenced this issue in EarthPortal/ontologies_linked_data Mar 14, 2024

update submission_processed notification to add invalidate_cache para…

e98b884

…m (#117)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate mappings from individual classes to classes in other ontologies #117

Duplicate mappings from individual classes to classes in other ontologies #117

mdorf commented Apr 30, 2021

mdorf commented Apr 30, 2021

mdorf commented May 1, 2021

mdorf commented May 1, 2021

mdorf commented May 1, 2021

graybeal commented May 1, 2021 •

edited

Loading

mdorf commented May 18, 2022 •

edited

Loading

mdorf commented May 19, 2022 •

edited

Loading

Duplicate mappings from individual classes to classes in other ontologies #117

Duplicate mappings from individual classes to classes in other ontologies #117

Comments

mdorf commented Apr 30, 2021

mdorf commented Apr 30, 2021

mdorf commented May 1, 2021

mdorf commented May 1, 2021

mdorf commented May 1, 2021

graybeal commented May 1, 2021 • edited Loading

mdorf commented May 18, 2022 • edited Loading

mdorf commented May 19, 2022 • edited Loading

graybeal commented May 1, 2021 •

edited

Loading

mdorf commented May 18, 2022 •

edited

Loading

mdorf commented May 19, 2022 •

edited

Loading