Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CURIEs to link Germline to Repertoire/Rearrangement #553

Open
bcorrie opened this issue Sep 22, 2021 · 19 comments · May be fixed by #770
Open

Use CURIEs to link Germline to Repertoire/Rearrangement #553

bcorrie opened this issue Sep 22, 2021 · 19 comments · May be fixed by #770
Assignees
Labels
GLDB/germlines Germline database WG and Germline schema
Milestone

Comments

@bcorrie
Copy link
Contributor

bcorrie commented Sep 22, 2021

Should resolve Task 3 in #157

  • link repertoire/rearrangement to germline. The link is currently at the rearrangement level but might want to reconsider this, maybe at the repertoire level with software processing and annotation.

See #157 (comment) for discussion.

@schristley
Copy link
Member

That task item is old, it's saying the germline database link is at the rearrangement level, but this is no longer true, it was put in DataProcessing awhile ago. Unless this is referring to something else...

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 22, 2021

For me this is addressing something else. The issue that this is capturing is formalizing that we will use CURIEs to refer to Germline objects, either Germline sets or Germline Genes - currently we have no formal indication as to what format fields like germline_set_id and germline_database take on and the formats we do have aren't really "computable". A generic string designation is "unsatisfactory" and CURIEs solve this problem 8-)

From #157 (comment) (with some edits) I suggested we could use CURIE nomenclature for Germline IDs as follows:

  • In our current DataProcessing object, we have germline_database. This could either be changed to refer to a CURIE that points to a GermlineSet (e.g. germline_database = "OGRDB_GERMLINESET:42") rather than a string or to treat it like an ontology if we want to have a text description of the germline database used as well as have an identifier for it (e.g. germline_database.label = "OGRDB Germline, downloaded 2021-09-17" and germline_database.id = "OGRDB_GERMLINESET:42". Similarly, this could be used to provide a more concise description when an IMGT germline set is used (e.g. germline_database = "IMGT:201736-4"). Presumably even if we re-engineer DataProcessing, something similar would still exist. DataProcessing objects will need to refer to the GermlineSets that were used to perform the data processing...
  • subject.genotype already refers to GermlineSets, so no work needs to be done there to reference GermlineSet but we need to specify a more rigorous mechanism for specifying a germline_set_id.
  • If we wanted to go even further, we could add rearrangement fields that store gene IDs for VDJC calls. v_call = "IGHV1-1*01", v_call_id = "OGRDB_GENE:44". This of course could be derived from going back to the Repertoire and presumably looking up v_call in the germline_database stored in the DataProcessing object so would be redundant.

@bcorrie bcorrie added this to the AIRR v2.0.0 milestone Jan 17, 2022
@bcorrie
Copy link
Contributor Author

bcorrie commented Mar 29, 2022

@williamdlees a question regarding connecting OGRDB Germline Sets to the AIRR Schema. I was just looking at OGRDB's germline sets, and was wondering if I understood how things were working.

On OGRDB the ID of the Germline Set is referenced in the URL as such: https://ogrdb.airr-community.org/germline_set/3

On the History tab for this germline set it says this is G00003.

If I wanted to refer to this Germline Set in the AIRR Schema how would I go about doing that? There seem to be two places where this might occur:

Genotype.documented_alleles.germline_set_ref
GermlineSet.germline_set_id

If we used the AIRR CURIE schema, we could have an OGRDB CURIEMap as follows:

  OGRDB_GERMLINESET:
    type: catalog
    default:
      map: OGRDB
      provider: OGRDB
    map:
      OGRDB:
        iri_prefix: "https://ogrdb.airr-community.org/germline_set/"

If I then set Genotype.documented_alleles.germline_set_ref = "OGRDB_GERMLINESET:3" then that CURIE would resolve to:

https://ogrdb.airr-community.org/germline_set/3

Now this is different than what you have in the AIRR Spec description, as the description for Genotype.documented_alleles.germline_set_ref says:

        germline_set_ref:
            type: string
            description: Unique identifier of the germline set and version, in standardized form (Repo:Label:Version)
            example: OGRDB:Human_IGH:2021.11
            x-airr:
                nullable: false

What are your thoughts on using the CURIE mechanism above to resolve this field? If you look at the versions tab on OGRDB for this Germline Set it has all the above information:

[BALB/c IGH](https://ogrdb.airr-community.org/germline_set/3)	Mouse	BALB/c	IGH	1	2022-02-28

@williamdlees
Copy link
Contributor

Pasting this here as the mail reply didn't make it in to the thread

It’s probably best to get the set from the REST API at https://ogrdb.airr-community.org/api/rather than the UI. Sorry, I could publish this a bit better, I will put some details on the Germline Sets page for a start.

The germline set will always have an identifier G followed by a number and the identifier will not change between versions.

From the API you’d retrieve the set as, for example, https://ogrdb.airr-community.org/api/germline/set/G00003/1 . It sounds as though this would map quite nicely – maybe OGRDB_GERMLINESET:G00003:1 ??

If that’s ok I can change the examples

@williamdlees

This comment was marked as duplicate.

@bcorrie
Copy link
Contributor Author

bcorrie commented Feb 6, 2024

@williamdlees has this been resolved? I am triaging AIRR v2.0 issues 8-)

@bcorrie
Copy link
Contributor Author

bcorrie commented Feb 6, 2024

Currently it doesn't look like germline_set_ref is a CURIE mappable entity as there are two levels to the reference (e.g. OGRDB:Human_IGH:2021.11) so maybe not. But if that is the case for v2.0 we should remove this issue from v2.0 and perhaps close it if that field can't be mapped with a CURIE.

@williamdlees
Copy link
Contributor

williamdlees commented Feb 7, 2024 via email

@bcorrie
Copy link
Contributor Author

bcorrie commented Feb 7, 2024

@williamdlees I am almost 100% sure that CURIEs only have a single IRI tag followed by a single identifier. So something like "OGRDB_GERMLINESET:G00003:1" mapping to "https://ogrdb.airr-community.org/api/germline/set/G00003/1" would not be valid CURIE processing/parsing.

Don't get me wrong, the ID "OGRDB_GERMLINESET:G00003:1" is easily parsed as an ID, but it does not fit the CURIE format. If that was a CURIE and the IRI tag "OGRDB_GERMLINESET" was mapped to "https://ogrdb.airr-community.org/api/germline/set/" then this would resolve to:

https://ogrdb.airr-community.org/api/germline/set/G00003:1

I think it is fine to have the ID as you have it defined if that fits your needs. It just isn't CURIE parseable, and it can't go into the CURIEMap object in the spec.

So we could consider this resolved as is. We have decided that CURIEs don't fit the needs of germline set IDs. Therefore we don't need to change your ID definition and we don't need to update the CURIEMap. I think that is the most simple path forward. This can always be revisited later...

@schristley
Copy link
Member

And this is somewhat of an aside, but as part of the AKC work, the OGRDB API needs to be reviewed and updated to bring it more in compliance as well as add missing functionality. It might be more efficient to do all that together instead of piecemeal.

Nevertheless, my opinion is that germline_set_ref should be a CURIE mappable entity.

@williamdlees
Copy link
Contributor

williamdlees commented Feb 8, 2024 via email

@bussec
Copy link
Member

bussec commented Feb 9, 2024

@williamdlees The relavant documentation can be found here:

In a nutshell: You can have a : as part of the reference, but not of the prefix. To avoid potential confusion (and overly simplified parsing routines), it would be best to avoid having more then a single colon. . and / should be save.

@bcorrie
Copy link
Contributor Author

bcorrie commented Feb 9, 2024

In a nutshell: You can have a : as part of the reference, but not of the prefix. To avoid potential confusion (and overly simplified parsing routines), it would be best to avoid having more then a single colon. . and / should be save.

@bussec is that correct? Would not '/' cause problems. CURIEs rely on IRIs and '/' is a special character in IRI space. If you have a '/' in the CURIE reference it would be interpreted as a '/' in the IRI and interpreted as an IRI path, no?

Now I suppose if you had OGRDB_GERMLINESET:G00003/1 and "OGRDB_GERMLINESET" was mapped to "https://ogrdb.airr-community.org/api/germline/set/" then this would resolve to an IRI as:

https://ogrdb.airr-community.org/api/germline/set/G00003/1

That is what @williamdlees is looking for, and it would work I suppose, but encoding IRI path in the CURIE reference doesn't seem to be how CURIEs were intended to be used???

@bcorrie
Copy link
Contributor Author

bcorrie commented Feb 9, 2024

It’s not a question of fitting my needs, the choice of : as a delimiter between the germline set and version was arbitrary. Is there a convention for that delimiter in the curie world

@williamdlees I think the question that I am unclear on is are there two name/ID spaces, each with their own set of identifiers ("germline set" and "version") or can there be one name/ID space ("versioned germline set"). With one name space you could:

OGRDB_GERMLINESET:G00003-1 with "OGRDB_GERMLINESET=https://ogrdb.airr-community.org/api/germline/set/"

Would give you: https://ogrdb.airr-community.org/api/germline/set/G00003-1

Or

OGRDB_GERMLINESET:G00003-1 with "OGRDB_GERMLINESET=https://ogrdb.airr-community.org/api/germline/set?"

Which would give you a get query: https://ogrdb.airr-community.org/api/germline/set?G00003-1

The query approach might be the better one, as then you can parse the ID in what ever way you want. You could encode whatever you wanted in the ID and the query would parse it and return the correct information for that ID.

Note if you needed both API interfaces, you could make it such that:

https://ogrdb.airr-community.org/api/germline/set?G00003-1
https://ogrdb.airr-community.org/api/germline/set/G00003/1

gave the same information, the first being the one that was used for CURIE resolution.

@williamdlees
Copy link
Contributor

williamdlees commented Feb 18, 2024 via email

@bussec
Copy link
Member

bussec commented Feb 18, 2024

@bcorrie After some meditation on the sacred scripture of RFC3987 and the epiphany that their notation is indentation-sensitive, one correction and one comment to my previous statement:

  • Correction: RFC3987 does indeed allow colons in the irelative-ref, but not in the first segment (Page 7)
  • Comment: The slash / is a special character in so far as it separates segments of the combined path. However, this should not be relevant to the construction of an IRI from a CURIE (which was the original question), it only becomes relevant on the side of the resolving service.

@bcorrie
Copy link
Contributor Author

bcorrie commented Feb 19, 2024

@williamdlees I think that would work. I think all we would need to do is update the descriptions for the three instances of germline_set_ref in the spec, is that correct?

@javh javh added this to Germlines Sep 9, 2024
@github-project-automation github-project-automation bot moved this to To do in Germlines Sep 9, 2024
@javh javh added GLDB GLDB/germlines Germline database WG and Germline schema and removed GLDB labels Sep 9, 2024
@javh
Copy link
Contributor

javh commented Sep 9, 2024

@williamdlees , can we remove this from the AIRR v2.0 milestone and move it to the AKC milestone? Do we need this functionality in the AIRR schema for v2.0? Do you think we can resolve questions about prefix and target uri?

@williamdlees
Copy link
Contributor

I'm happy to work with whatever milestone makes sense to the group.

You may be aware that, with substantial input from Scott, we have drafted an implemented a revised API for ogrdb which is openapi3 compatible. Details here: https://ogrdb.airr-community.org/api_v2/swagger/.

I'm afraid I don't feel confident to draft a CURIE definition that will pass muster with the group, as the history on this thread shows, but if someone else can propose one that is satisfactory, and complies with our API schema, I am very happy to implement it in OGRDB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GLDB/germlines Germline database WG and Germline schema
Projects
Status: To do
Development

Successfully merging a pull request may close this issue.

5 participants