-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We need a more formal, fully-qualified identifiers for repository objects #347
Comments
Just to clarify, are we talking formal DOI as in: https://www.doi.org/ I think at a minimum an AIRR compliant repository should have a formal DOI. Beyond that, I am not sure how far down the DOI path we should go... It makes some sense to me that the study data for a specific study in a specific repository could use a DOI, but given that most studies, through their publication, would already have a DOI, this might be overkill. If we added a study_doi field to the metadata (for the publication DOI), that might cover it. If referring to the data in a specific study in the AIRR Data Commons, the combination of the Repository DOI and the Study DOI (findable in the study_doi field) would suffice. My gut feeling is that going down the DOI path much further than that might be overkill (formal DOI generation requires a DOI provider), but certainly we could and possibly should use UUID as per https://tools.ietf.org/html/rfc4122.html for internal object to provide a uniqueness criteria. They are easy to generate in that many languages have libraries that generate them... I would also note that the study_id field can be considered a unique identifier if the definition in the spec is followed as per |
No, just DOI in the context of the FAIR standard, which doesn't require the doi.org service to be used. The FAIR paper defines DOI this way:
So I think using a URL (https://vdjserver.org/airr/v1/repertoire/abc) to access the data object itself would be sufficient.
I think this might be worthwhile, but we should probably lump this into the discussion about the "registry" which the CRWG hasn't really defined yet...
Me too, so this isn't about digging a deeper hole. It's about how we are going insure that when you get an AIRR file (off the web, sent in email, supplemental file with an article, etc.), that you can back to the original object in the data repository. I think of this as a provenance issue, but it is also a practical issue. I may give you a |
To my understanding there is only one type of DOI and that's the one governed by doi.org. I agree that the Wilkinson et al. describe it as it would be a generic term, but IMO it's not. Out of curiosity I just checked on the costs, and at 0.06 USD per DOI it would be feasible to create DOIs at least for study objects (fees can be found at https://www.crossref.org/fees/ ). The advantage of a DOI vs an UUID is that it is clear how to resolve it. However, I don't know whether the record it resolves to is clearly defined. IMO it would not hurt if the data of a study has a separate DOI then the publication located at a publisher's site. But I agree that we don't want to create DOIs for each single |
Is this something we need to resolve for ADC API v1? |
Summarizing a discussion that @schristley, @bcorrie and me had via mail. Will probably not require any direct action, just putting it here for future reference: The generic term for the feature we are looking for is "persistent identifier" (PID), of which the DOI would be a specific implementation. EOSC has an own sub-working group to address PID usage, who recently published a document on it [DOI:10.5281/zenodo.3574203]. In the document a PID is defined as:
The question is whether we really need all these feature for all AIRR objects, i.e., how far would we go with PIDs, when would (non-resolvable) UUIDs come in handy, and where do we only need local uniqueness? The four main levels that a PID could be applied to are:
|
probably not, we mainly need for the new (experimental) objects like Clone, Cell and etc., so we can resolve in concert with their release. |
I've reviewed the W3C standard for decentralized identifiers, and it looks like it will work quite well for our purposes. I'm considering this standard just for the identifiers in the AIRR Data Model used to reference AIRR objects, external identifiers outside our control are handled with #464 A decentralized identifier (DID), has a simple syntax consisting of three parts, a colon separates the three parts:
where The DID method is the key part. It is somewhat analogous to the first part of a CURIE. It's creating the unique namespace for the identifiers. Also, according to the spec, "a DID method defines how implementers can realize the features described by this specification". We need to define a DID method and SHOULD register it with DID Registry. So in an odd twist, creating a decentralized identifier suggests registering in a central repository namespace... though it's not mandatory. Anyways, my suggestion is we define and register the
The DID spec talks a lot about verification, security, and etc., but all of those capabilities are optional. The DID method must implement a number of functions for DID resolution and URL dereferencing, though the spec leaves it almost completely open for how the DID method does that. Conceptually I find this very similar to how we are resolving CURIE identifiers, and I believe we can implement much the same for DIDs. What's left for us to consider is how to define the DID method-specific identifier. There is no requirement in the DID spec that the TYPE of resource, which the DID references, must be the same. So we could do something simple with just numbers, like this, but this doesn't provide us enough flexibility as we want identifiers to resolve to difference repositories in the ADC.
With the DID method as
Then DIDs look like this:
but even this isn't quite complete, because does
But it's equally valid to combine those two sub-namespaces together into one like so, we have complete control over how the format.
Currently, I prefer the first option with two namespaces. Hopefully now we can see how DIDs can be implemented. Like CURIEs, we have a resolution table in the AIRR schema that define how
|
In case we decide against decentralized identifiers, the URN service run by GEANT might be potential way to be able to coin PIDs without having to run the registry: https://tools.ietf.org/html/rfc4926 |
The dual usage/requirements for identifiers that link/reference AIRR objects within the AIRR Data Model continue to bite us. The dual usage being:
For An easy idea is to separate the fields, i.e. have Today, tools assume that One could make the argument, like with Another idea is to not have I don't see a solution that doesn't requiring having separate fields if we want to store both the original identifier value and the ADC PID. Any other ideas? To summarize:
|
My 2 cents on this:
|
Can I throw out a 4th option? What about some sort of provenance object to store these relationships? It'd be essentially the same thing as |
The problem with the |
I think the question of having globally unique identifiers for objects in ADC repositories and managing provenance and how such globally unique objects are related to each other are two different topics, no? |
BTW, I like this structure because the |
I'm seeing two potential solutions:
We've mainly been considering 1 but
Any other pros/cons? Regardless of 1 or 2, the ADC needs the ability to overwrite any local values assigned by tools when data is loaded into the ADC. IMO, I'm leaning toward 1 at the moment. The main CON is it requires re-assigning identifier values in the ADC, but I think the flexibility of a CURIE-like resolver is a significant PRO. |
😱 |
But we already do this for In any case, what you're describing seems to be a "backend" ADC feature/use, so I don't think it should complicate end user-facing |
It also seems bizarre and bewildering to me that we are so adamant that we throw this information away! Why is there such a reluctance to having an extra field that captures this info as part of the standard? There is a very strong data curation use case to keep it, so I am also bewildered... 8-) The standard isn't just about analysis, but data reusability and data curation. |
Yep, and I argued strongly against that one too - but caved in because it was only one field... |
Because there is no "information" there that is being discarded! And trying to preserve the original value of the field by adding a new field pollutes the schema without adding any analysis benefit in the ways that @javh and I have been arguing through (apparently) two entire threads now :-) |
I don't agree - throwing away information that an annotation tool provides has nothing to do with the ADC. This is 100% a curation process issue. |
This implies that you are also storing the entire dataset in its original format somewhere accessible-but-outside-of-the-ADC?!? But isn't the point of the ADC to be the copy of record so that the original becomes irrelevant? Do you really have 2 copies of everything in iReceptor? |
It's not "information." Metadata, perhaps. And if curation isn't part of the ADC, then who are we doing this for? It's not part of the end-user data reuse process... |
I think having |
Call them javh edit: Sorry @schristley, I accidentally edited this instead of quoting (I don't know how). Should be restored now. |
Nope, but we want to support reproducibility where ever we can... So no data in the pipeline is ever really irrelevant. The point of the ADC is data sharing, data reuse, and reproducibility. I would argue that is also the point of the AIRR Standard as well. The AIRR Standard points to source records of information throughout. SRA files ( Curation is part of this entire process - it is not specific to the ADC. If you describe a study using the AIRR Standard, you are curating data according to the AIRR Standard. If you want to be truly reproducible, at any point in the processing pipeline, I should be able to use the AIRR Standard to go from one processing step to another processing step, and able to reproduce where a piece of data came from. Here is my curator use case. I don't need to be using the ADC for this, this could be using studies curated for analysis and stored completely on disk using the AIRR format files for repertoire, rearrangement, cell, clone, etc. As a data curator if I want to confirm that data in my AIRR files (or my ADC repository) is correct, I SHOULD be able go back to my source files and confirm this is indeed the case. When I lost sequence_id, I lost the ability to do that for the original fastq files - damn, but hey it is only the sequence that we are talking about, and we have millions 8-) But now we are talking about cells, which have complicated linkages across rearrangements, clones, cells, and gex data. In the case of annotation tools, these linkages are across many files. So when I process some 10X studies (N samples from one study and M samples from another study) generating AIRR compliant files in preparation for analysis, I replace the source 10X cell_id with a unique AIRR cell_id to make sure Now I want to confirm that the data I just processed for a certain 10X cell_id (TACGGATGTACACCGC-1) from a single subject in my source data is correct across the data I am going to use for my analysis. I can't... Similarly, if I want to look at an AIRR unique cell_id in my processed data and then find the source information in the original 10X produced data files. Again, I can't... So we have broken the link between the data in the AIRR compliant files to the original source data - data/"information" can no longer be mapped between the two... Now if you truly trust the tools that do all of that processing, then maybe you don't want to do any provenance or reproducibility checks... But that is not how I would do things 8-) Here is an example of what you get from a repository with our current implementation. If I maintain the annotation tool cell_id in some form, I can cross check the validity of the data I loaded with the original 10X files. If I don't, I can't... If you are a data steward maintaining an ADC repository, this is an important step... Basically I want to be able to ensure that
|
@bcorrie We don't have to keep going round and round this in this issue. I brought up the issue initially, and I was happy with doing a custom solution, but you'd like something more formal, which is fine. That's been recognized with #589 and we can discuss solutions there. Let's get this issue back onto its main track of FAIR for ADC objects. |
I don't think it helps. At least, not as I'm interpreting it. The I guess the question is whether that's a problem. |
I'm not sure what you mean by "foreign". If you are thinking "foreign key", that's not what is meant. I also don't understand how "update the foreign record" matters. This is persistent access to a read-only object. According to FAIR, (meta)data are assigned a globally unique and persistent identifier. There isn't the requirement that these two attributes are satisfied by a single field. For example, IEDB splits them into two fields, one which is the identifier (which doesn't look globally unique but is because IEDB is a central database), and another which is the IRI for persistence.
|
@schristley, Ah, I see... maybe. I'm getting my signals crossed here. I was thinking of The |
Right. Sorry, I was mentioning IMO,
|
I just thought of another major CON for doing 2 instead of 1.
For example, say I had a rearrangement record that references a |
Thinking about the actual content of the identifier, if we go with a CURIE-like structure, where we need a resolver, we can support decentralized identifiers later on, if we want. It would just involve extending the resolver code. We can support both and repositories can pick the one they want to implement. The other thing is whether a type is needed as part of the identifier:
But this maybe isn't needed? The reason is the field, Another point is that the complete value is the identifier value, so an ADC API call for that specific
Likewise, when sending a POST query
If this wasn't the case, that is, if just the trailing code (or number) was the identifier, users would have to constantly parse the value to pull out the appropriate bits. This also mean that our CURIE-like resolver cannot manipulate the identifier in any way, which is done for some ontology fields. If the identifier values change, for queries, for data returned from the ADC, etc., then it fails at being an identifier and objects cannot be linked. |
I am thinking that this issue is probably not going to be resolved in v2.0 (and doesn't need to be resolved in 2.0). Moving this to v2.1. |
@bcorrie In some sense, I think we are making this issue more complicated than it needs to be, at least in the context of the ADC. All we need to do is make these identifiers (in the ADC) be CURIEs. The prefix part points to the global service, i.e. the ADC repository, and the local identifier part can be whatever that is interpreted by the ADC repository. I think that James' presentation of LinkML and his discussion of CURIEs shows that it works quite well for creating globally unique identifiers that can be resolved and be FAIR. AKC is going to need them. The question is do we implement them first in the data integration scripts (ADC --> AKC) as a test then port them back into the ADC, or just put them in the ADC first? |
This came up in a side discussions here and here. Creating a separate issue as those other issues are becoming overloaded with multiple topics.
The
id
fields we are defining in the AIRR Data Model aren't complete digital object identifiers required by FAIR when taken in context of the AIRR Data Commons because they don't indicate where that object is stored, i.e. they are missing the (F)indable attribute.Here's what I believe are the key issues and requirements:
The text was updated successfully, but these errors were encountered: