Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We need a more formal, fully-qualified identifiers for repository objects #347

Open
schristley opened this issue Mar 11, 2020 · 86 comments
Open
Labels
ADC API V2 AIRR Data Commons API V2
Milestone

Comments

@schristley
Copy link
Member

schristley commented Mar 11, 2020

This came up in a side discussions here and here. Creating a separate issue as those other issues are becoming overloaded with multiple topics.

The id fields we are defining in the AIRR Data Model aren't complete digital object identifiers required by FAIR when taken in context of the AIRR Data Commons because they don't indicate where that object is stored, i.e. they are missing the (F)indable attribute.

Here's what I believe are the key issues and requirements:

  • There are a key set of identifier fields for linking AIRR objects in the AIRR Data Model
  • There are two primary scopes for AIRR objects: 1) local analysis scope, and 2) ADC
  • We would like to define uniqueness criteria for these identifiers so tools can use data from both scopes without requiring special coding to handle those scopes.
  • For the local analysis scope, tools often aren't concerned with (or aware of) the larger context and might assign identifiers that are only unique in the local scope.
  • We would like the uniqueness criteria for objects in the ADC to be such that 1) there is no conflict in identifiers across different repositories and 2) the identifier can be used to resolve back to the specific object in the data repository.
  • F in FAIR says that (meta)data are assigned a globally unique and persistent identifier.
  • We can specify rules that apply uniformly to both scopes, or we can specify rules specific to each scope.
@schristley schristley added the ADC API V1 AIRR Data Commons API V1 label Mar 11, 2020
@schristley schristley changed the title We need a more formal, fully-qualified DOI for repository objects We need a more formal, fully-qualified DOI for repository AIRR Data Model objects Mar 11, 2020
@schristley schristley changed the title We need a more formal, fully-qualified DOI for repository AIRR Data Model objects We need a more formal, fully-qualified DOI for repository objects Mar 11, 2020
@bcorrie
Copy link
Contributor

bcorrie commented Mar 11, 2020

Just to clarify, are we talking formal DOI as in: https://www.doi.org/

I think at a minimum an AIRR compliant repository should have a formal DOI.

Beyond that, I am not sure how far down the DOI path we should go... It makes some sense to me that the study data for a specific study in a specific repository could use a DOI, but given that most studies, through their publication, would already have a DOI, this might be overkill. If we added a study_doi field to the metadata (for the publication DOI), that might cover it. If referring to the data in a specific study in the AIRR Data Commons, the combination of the Repository DOI and the Study DOI (findable in the study_doi field) would suffice.

My gut feeling is that going down the DOI path much further than that might be overkill (formal DOI generation requires a DOI provider), but certainly we could and possibly should use UUID as per https://tools.ietf.org/html/rfc4122.html for internal object to provide a uniqueness criteria. They are easy to generate in that many languages have libraries that generate them...

I would also note that the study_id field can be considered a unique identifier if the definition in the spec is followed as per Unique ID assigned by study registry assuming the study registries assign non-overlapping IDs.

@schristley
Copy link
Member Author

Just to clarify, are we talking formal DOI as in: https://www.doi.org/

No, just DOI in the context of the FAIR standard, which doesn't require the doi.org service to be used. The FAIR paper defines DOI this way:

DOI—Digital Object Identifier; a code used to permanently and stably identify (usually digital) objects. DOIs provide a standard mechanism for retrieval of metadata about the object, and generally a means to access the data object itself.

So I think using a URL (https://vdjserver.org/airr/v1/repertoire/abc) to access the data object itself would be sufficient.

I think at a minimum an AIRR compliant repository should have a formal DOI.

I think this might be worthwhile, but we should probably lump this into the discussion about the "registry" which the CRWG hasn't really defined yet...

My gut feeling is that going down the DOI path much further than that might be overkill.

Me too, so this isn't about digging a deeper hole. It's about how we are going insure that when you get an AIRR file (off the web, sent in email, supplemental file with an article, etc.), that you can back to the original object in the data repository.

I think of this as a provenance issue, but it is also a practical issue. I may give you a Cell file but say, use the DOIs in that file to download the rearrangements. Right now, the standard _ids aren't enough sufficient to do a http request and download the data.

@bussec
Copy link
Member

bussec commented Mar 17, 2020

No, just DOI in the context of the FAIR standard, which doesn't require the doi.org service to be used.

To my understanding there is only one type of DOI and that's the one governed by doi.org. I agree that the Wilkinson et al. describe it as it would be a generic term, but IMO it's not. Out of curiosity I just checked on the costs, and at 0.06 USD per DOI it would be feasible to create DOIs at least for study objects (fees can be found at https://www.crossref.org/fees/ ).

The advantage of a DOI vs an UUID is that it is clear how to resolve it. However, I don't know whether the record it resolves to is clearly defined. IMO it would not hurt if the data of a study has a separate DOI then the publication located at a publisher's site.

But I agree that we don't want to create DOIs for each single Rearrangement :-)

@bcorrie
Copy link
Contributor

bcorrie commented Apr 5, 2020

Is this something we need to resolve for ADC API v1?

@bussec
Copy link
Member

bussec commented Apr 5, 2020

Summarizing a discussion that @schristley, @bcorrie and me had via mail. Will probably not require any direct action, just putting it here for future reference:

The generic term for the feature we are looking for is "persistent identifier" (PID), of which the DOI would be a specific implementation. EOSC has an own sub-working group to address PID usage, who recently published a document on it [DOI:10.5281/zenodo.3574203]. In the document a PID is defined as:

  • globally unique
  • persistent
  • resolvable

The question is whether we really need all these feature for all AIRR objects, i.e., how far would we go with PIDs, when would (non-resolvable) UUIDs come in handy, and where do we only need local uniqueness?

The four main levels that a PID could be applied to are:

  • Repository: It was already suggested by @bcorrie that AIRR-compliant repositories should have PIDs. iReceptor Public Archive and VDJServer already have DOIs through fairsharing.org. Whether this is a recommendation (to enhance citability) or a requirement (mandatory ID within a standard) is up for discussion. This will also depend on the way how (and whether) the subsequent PIDs will be resolved.
  • Study: In most cases there will be a DOI assigned to the related publication, however we need to keep in mind that this refers to a different object class, i.e., scholarly communication instead of the actual data set. This is usually sufficient for a human curator to quickly find the associated data sets, but this might not be true for a computer. BioProject IDs referring to a study can be considered to be PIDs (as the resolver is broadly known), but will refer to a record at INSDC, not in an AIRR repository.
  • Repertoire: While this is also a good candidate for a PID there are a couple of points to consider:
    1. We clearly need PIDs for data sets of a given study. However, whether we need them at the level of Sample or on the level of Repertoire needs further discussion.
    2. Repertoire currently does not have a strict definition (also see Cell and Repertoire definition #361) and Repertoire objects can be generated dynamically during queries. This could lead to inflation of PIDs and it is questionable whether there is any added value to this.
    3. As this is an AIRR-specific object, we need to find ways to mint and administer these PIDs.
  • Rearrangement: Most certainly not, as PIDs usually refer to a data set (e.g., a table), not the individual datum (e.g. a line within the table). Furthermore it currently hard to see a use case for this as long as we have the possibility to create arbitrary and non-overlapping sets of rearrangements, i.e., repertoires.

@schristley
Copy link
Member Author

Is this something we need to resolve for ADC API v1?

probably not, we mainly need for the new (experimental) objects like Clone, Cell and etc., so we can resolve in concert with their release.

@bcorrie bcorrie added ADC API V2 AIRR Data Commons API V2 and removed ADC API V1 AIRR Data Commons API V1 labels Apr 8, 2020
@schristley schristley changed the title We need a more formal, fully-qualified DOI for repository objects We need a more formal, fully-qualified identifiers for repository objects Jun 15, 2021
@schristley
Copy link
Member Author

I've reviewed the W3C standard for decentralized identifiers, and it looks like it will work quite well for our purposes. I'm considering this standard just for the identifiers in the AIRR Data Model used to reference AIRR objects, external identifiers outside our control are handled with #464

A decentralized identifier (DID), has a simple syntax consisting of three parts, a colon separates the three parts:

did:example:identifier

where did is static and defines this as a decentralized identifier, example is called the DID method, and identifier is called the DID method-specific identifier. The DID spec places few limitations on the identifier part; we can even have additional colons in it if we want.

The DID method is the key part. It is somewhat analogous to the first part of a CURIE. It's creating the unique namespace for the identifiers. Also, according to the spec, "a DID method defines how implementers can realize the features described by this specification". We need to define a DID method and SHOULD register it with DID Registry. So in an odd twist, creating a decentralized identifier suggests registering in a central repository namespace... though it's not mandatory.

Anyways, my suggestion is we define and register the airr DID method. That is, all AIRR DIDs look like this:

did:airr:identifier

The DID spec talks a lot about verification, security, and etc., but all of those capabilities are optional. The DID method must implement a number of functions for DID resolution and URL dereferencing, though the spec leaves it almost completely open for how the DID method does that. Conceptually I find this very similar to how we are resolving CURIE identifiers, and I believe we can implement much the same for DIDs.

What's left for us to consider is how to define the DID method-specific identifier. There is no requirement in the DID spec that the TYPE of resource, which the DID references, must be the same. So we could do something simple with just numbers, like this, but this doesn't provide us enough flexibility as we want identifiers to resolve to difference repositories in the ADC.

did:airr:123
did:airr:124
did:airr:567

With the DID method as airr, that provides a global AIRR namespace, and it's up to us if we want to impose additional structure and sub-namespaces to it. My suggestion is that we define an additional repository sub-namespace:

did:airr:repository:identifer

Then DIDs look like this:

did:airr:ipa:123
did:airr:vdjserver:124
did:airr:orgrdb:567

but even this isn't quite complete, because does did:airr:vdjserver:124 refer to a Repertoire, a Rearrangement, or another AIRR object? If we want to do full URL dereferencing, like we are doing with CURIE, we need to know the type to know the proper ADC API end point to hit. Here's where I think we have options, one simple idea is to add another sub-namespace level that defines the type.

did:airr:repository:type:identifer

did:airr:ipa:repertoire:123
did:airr:vdjserver:repertoire:124
did:airr:vdjserver:germline_set:124
did:airr:orgrdb:germline_set:567

But it's equally valid to combine those two sub-namespaces together into one like so, we have complete control over how the format.

did:airr:repository_and_type:identifer

did:airr:ipa_repertoire:123
did:airr:vdjserver_repertoire:124
did:airr:vdjserver_germline_set:124
did:airr:orgrdb_germline_set:567

Currently, I prefer the first option with two namespaces.

Hopefully now we can see how DIDs can be implemented. Like CURIEs, we have a resolution table in the AIRR schema that define how ipa, vdjserver and ogrdb can be de-referenced into URLs.

did:airr:vdjserver:repertoire:124
==>
https://vdjserver.org/airr/v1/repertoire/124

@bussec
Copy link
Member

bussec commented Oct 7, 2021

In case we decide against decentralized identifiers, the URN service run by GEANT might be potential way to be able to coin PIDs without having to run the registry:

https://tools.ietf.org/html/rfc4926
https://wiki.geant.org/display/URN/Registry+Home

@schristley
Copy link
Member Author

The dual usage/requirements for identifiers that link/reference AIRR objects within the AIRR Data Model continue to bite us. The dual usage being:

  • user-defined (or tool-assigned) identifier values that are locally consistent within files for running analysis tools and such.
  • globally unique identifier values for ADC objects that are FAIR.

For sequence_id, we agreed that the ADC can overwrite the identifier value with their own to provide a PID for the rearrangement record in the repository. This really isn't optimal because there are efforts around analysis reproducibility where we'd like to backtrack to the original sequence in a raw data file, and thus want that original sequence_id. This becomes more problematic with fields like cell_id and clone_id where annotations tools generate those ids and use them throughout multiple file/records. Overwriting the identifier in the ADC loses any linkage with data stored outside ADC. Thus, I believe we really need to consider maintaining original identifier values (like we do with subject_id, sample_id) as separate from ADC PIDs.

An easy idea is to separate the fields, i.e. have *_pid fields which hold the ADC PID while the *_id fields hold the original user-defined (tool assigned) value. Yes, it creates more fields but at least the semantics for each are clear and precise. This seems like it can work, except for one key problem. What do tools do?

Today, tools assume that *_id are unique within the local context of data files, but there is a scenario where that breaks down: downloading multiple studies from the ADC, and then combining the multiple study data together. There's no guarantee that a clone_id or *_id from one study doesn't conflict with the *_id values from another study. However, if the tools used the *_pid fields instead then they got uniqueness. But that complicates tool logic, every time they want to use an identifier, they have to decide should they use the *_pid or the *_id fields.

One could make the argument, like with subject_id and sample_id, that the uniqueness is only guaranteed with the study and/or within the Repertoire, and thus tools need to use compound keys, e.g. study_id, repertoire_id, data_processing_id, clone_id. Unfortunately, this doesn't completely solve the problem as the top-level objects (Repertoire, RepertoireGroup, DataProcessing) can still have conflicts. That is, if a tool assigns repertoire_id for local files, and the ADC assigns repertoire_pid, there's no guarantee repertoire_id is unique across studies, so we are back to the same problem, though maybe now it's less # of fields to consider, i.e. only the top-level AIRR objects.

Another idea is to not have *_pid fields, but instead when the data is loaded into the ADC, the *_id fields are assigned the PID, and the original user-defined value is put in a *_original_id field. So it's still the idea of having separate fields. In this case though, tools can continue to assume *_id are unique, and the problem of combining data from multiple studies from the ADC is solved. The exception is if tools want to link ADC data with external data, it will have to know to use the *_original_id fields.

I don't see a solution that doesn't requiring having separate fields if we want to store both the original identifier value and the ADC PID. Any other ideas?

To summarize:

  • Have separate *_id and *_pid fields. Tools will need logic to use one or the other to do lookups.
  • Have separate *_id and *_pid fields, but rely upon the compound nature of the *_id to define scope, e.g. clone_id is only unique with a repertoire, study and data processing. Tools would need to use that compound nature when doing lookups. There would still be potential conflict for top-level AIRR objects, so tools will need logic to use *_id or *_pid for them.
  • Have separate *_id and *_original_id fields. ADC can overwrite *_id with PID and puts the original value in *_original_id. Tools can assume *_id is unique and links ADC objects, when linking to external non-ADC data, the *_original_id fields need to be used.

@bussec
Copy link
Member

bussec commented Jan 24, 2022

My 2 cents on this:

  • I agree with the general idea of having separate fields
  • To maintain the original value is an important but less frequent use case. In addition, using the original identifiers for linkage to non-ADC data might be subject to further ambiguity (see below). Therefore I would give preference to an *_id/*_original_id solution.
  • Compound IDs that rely on understanding the structure of the AIRR Schema seem complex and potentially error-prone to me.
  • Are we sure that there will always be only a single original ID to store?

@javh
Copy link
Contributor

javh commented Jan 24, 2022

*_id and *_original_id makes the most sense to me as well. But, this seems like a pretty specific use case that might be a job for a custom field.

Can I throw out a 4th option? What about some sort of provenance object to store these relationships? It'd be essentially the same thing as *_original_id, but stored in a separate table. It would also lend itself to lists of original identifiers, links to DataProcessing for how the change was made, etc.

@bcorrie
Copy link
Contributor

bcorrie commented Jan 25, 2022

The problem with the _original_id concept is that the field names in the source files will be of the _id form. 10X produces data with clone_id and cell_id in their files (same with Immcantation etc, no), and these naturally map to the field names in the spec. It seems more natural to me to maintain those fields as they are in the original data from the annotation tool and to have a new, specific field that more explicitly states what it is. For example, it it is truly a persistent ID as per the FAIR PID definition, then maybe it should be _pid but it it is "just" a CURIE that turns the ID in the repository into something that is globally unique maybe it should be a _gid for global identifier?

@bcorrie
Copy link
Contributor

bcorrie commented Jan 25, 2022

I think the question of having globally unique identifiers for objects in ADC repositories and managing provenance and how such globally unique objects are related to each other are two different topics, no?

@bcorrie
Copy link
Contributor

bcorrie commented Jan 25, 2022

did:airr:repository:type:identifer
did:airr:vdjserver:repertoire:124
==>
https://vdjserver.org/airr/v1/repertoire/124

BTW, I like this structure because the vdjserver part maps directly to the servers objects in the OpenAPI 3.0 spec. In that way, the DID structure and the OpenAPI server and path objects map nicely.

@schristley
Copy link
Member Author

I'm seeing two potential solutions:

  1. _id fields satisfy the both global uniqueness and persistence. This implies some CURIE-like value that provides both properties.
  2. separate global uniqueness and persistence. The _id fields have global uniqueness, and separate _ref fields contain a persistent reference.

We've mainly been considering 1 but GermlineSet uses 2 in its draft. Here are pros/cons that I can think of:

  • PRO: 1 has less fields.
  • PRO: the CURIE-like structure of 1 almost guarantees global uniqueness.
  • CON: 1 requires a resolver, to interpret the value and translate into a URL. CURIE prefixes need to be stored in Schema; we would need to update whenever a new data repository (new prefix) is added to ADC. We might get around this by having an ADC registry.
  • PRO: 2 could use a resolver but it could simply be the direct URL, e.g., https://vdjserver.org/airr/v1/repertoire/1159043104164212245-242ac114-0001-012
  • CON: If 2 uses a CURIE-like resolver, it seems redundant; might as well just use 1.
  • PRO: Using a resolver allows flexibility in the data repository, i.e. hostnames can change, resolvers can be updated with new features, etc.
  • CON: Having a fixed URL in 2 provides less flexibility, in order to be persistent that host/API must always be available.
  • CON: 1 requires re-assigning the identifier values in the ADC, for example, a repertoire_id might be vdjserver:123. This could be heavy burden on data repositories as they might need to update all the records in the database (they could also do some translation on the fields during input/output)
  • CON: For 1, tools have no way to assign the persistent value, so the ADC would always need to overwrite the _id values.
  • PRO: For 2, tools that assign UUIDs, those UUIDs could potentially be kept when loading into the ADC.
  • CON: For 2, tools that don't assign UUIDs, the ADC would need to overwrite the _id values.

Any other pros/cons?

Regardless of 1 or 2, the ADC needs the ability to overwrite any local values assigned by tools when data is loaded into the ADC.

IMO, I'm leaning toward 1 at the moment. The main CON is it requires re-assigning identifier values in the ADC, but I think the flexibility of a CURIE-like resolver is a significant PRO.

@scharch
Copy link
Contributor

scharch commented Feb 22, 2022

Groundhog Day thread (#340)

😱

@scharch
Copy link
Contributor

scharch commented Feb 22, 2022

My argument is that the since the cell_id is a field that is produced by many pipelines that process Cell data, maybe it isn't a great idea for us to use that field name as the field that contains a PID for a cell (which is what the ADC requries). We absolutely need a PID field, but I think it is a mistake to throw the tool generated linking field across these files away!

But we already do this for sequence_id, and I can't see how cell_id (or clone_id or data_processing_id or...) is any different. I do understand the desire to be able to trace data back to its source, but the nature of the schema already limits this: sequence_ids can't really be traced back to raw fastqs without a lot of work to re-execute the DataProcessing, and even that assumes that the DataProcessing is actually complete/fully specified and the link out to SRA/etc is stable and correct. And in some cases (especially Tree generation), even a complete DataProcessing may not be deterministic...

In any case, what you're describing seems to be a "backend" ADC feature/use, so I don't think it should complicate end user-facing *_id fields. We've talked in the past about ADC-specific extensions to the schema, and a Provenance object seems like a good fit for that category...

@bcorrie
Copy link
Contributor

bcorrie commented Feb 22, 2022

@bcorrie This thought is completely bizarre and bewildering to me.

It also seems bizarre and bewildering to me that we are so adamant that we throw this information away! Why is there such a reluctance to having an extra field that captures this info as part of the standard? There is a very strong data curation use case to keep it, so I am also bewildered... 8-) The standard isn't just about analysis, but data reusability and data curation.

@bcorrie
Copy link
Contributor

bcorrie commented Feb 22, 2022

But we already do this for sequence_id, and I can't see how cell_id (or clone_id or data_processing_id or...) is any different.

Yep, and I argued strongly against that one too - but caved in because it was only one field...

@scharch
Copy link
Contributor

scharch commented Feb 22, 2022

It also seems bizarre and bewildering to me that we are so adamant that we throw this information away! Why is there such a reluctance to having an extra field that captures this info as part of the standard? There is a very strong data curation use case to keep it, so I am also bewildered... 8-) The standard isn't just about analysis, but data reusability and data curation.

Because there is no "information" there that is being discarded! And trying to preserve the original value of the field by adding a new field pollutes the schema without adding any analysis benefit in the ways that @javh and I have been arguing through (apparently) two entire threads now :-)

@bcorrie
Copy link
Contributor

bcorrie commented Feb 22, 2022

In any case, what you're describing seems to be a "backend" ADC feature/use, so I don't think it should complicate end user-facing *_id fields.

I don't agree - throwing away information that an annotation tool provides has nothing to do with the ADC. This is 100% a curation process issue.

@scharch
Copy link
Contributor

scharch commented Feb 22, 2022

So if I ever want to go back and try to understand something about the data in my repository by looking at the original data, I can. That is only possible if the tool produced cell_id is stored in the repository and can be referenced in the original data.

This implies that you are also storing the entire dataset in its original format somewhere accessible-but-outside-of-the-ADC?!? But isn't the point of the ADC to be the copy of record so that the original becomes irrelevant? Do you really have 2 copies of everything in iReceptor?

@scharch
Copy link
Contributor

scharch commented Feb 22, 2022

In any case, what you're describing seems to be a "backend" ADC feature/use, so I don't think it should complicate end user-facing *_id fields.

I don't agree - throwing away information that an annotation tool provides has nothing to do with the ADC. This is 100% a curation process issue.

It's not "information." Metadata, perhaps. And if curation isn't part of the ADC, then who are we doing this for? It's not part of the end-user data reuse process...

@javh
Copy link
Contributor

javh commented Feb 22, 2022

The _id fields have global uniqueness, and separate _ref fields contain a persistent reference.

I think having _ref fields is fine, but I don't see them as a solution here. If a _ref is a foreign key / citation when uploaded, there's no guarantee that it's going to remain a valid reference in the future and it can't be trusted as a linking identifier in the ADC. If I'm understanding the _ref field correctly, then it's really just a more formal comment string.

@schristley
Copy link
Member Author

schristley commented Feb 22, 2022

The _id fields have global uniqueness, and separate _ref fields contain a persistent reference.

I think having _ref fields is fine, but I don't see them as a solution here. If a _ref is a foreign key / citation when uploaded, there's no guarantee that it's going to remain a valid reference in the future and it can't be trusted as a linking identifier in the ADC. If I'm understanding the _ref field correctly, then it's really just a more formal comment string.

Call them _pid fields if it helps; they should contain whatever is necessary for persistent access to the object. The point being that the persistence attribute is separated from the global uniqueness attribute. Regardless, I don't think they need to be separated, but I offered it as an alternative solution in case somebody thought of some PROs.

javh edit: Sorry @schristley, I accidentally edited this instead of quoting (I don't know how). Should be restored now.

@bcorrie
Copy link
Contributor

bcorrie commented Feb 22, 2022

So if I ever want to go back and try to understand something about the data in my repository by looking at the original data, I can. That is only possible if the tool produced cell_id is stored in the repository and can be referenced in the original data.

This implies that you are also storing the entire dataset in its original format somewhere accessible-but-outside-of-the-ADC?!? But isn't the point of the ADC to be the copy of record so that the original becomes irrelevant? Do you really have 2 copies of everything in iReceptor?

Nope, but we want to support reproducibility where ever we can... So no data in the pipeline is ever really irrelevant.

The point of the ADC is data sharing, data reuse, and reproducibility. I would argue that is also the point of the AIRR Standard as well. The AIRR Standard points to source records of information throughout. SRA files (RawSequenceData object), INSDC Bioproject information (study_id), the files in which the annotated data came from (data_processing_files), etc. These are critical for reproducibility. We don't store everything, but we try to make it possible to reproduce everything...

Curation is part of this entire process - it is not specific to the ADC. If you describe a study using the AIRR Standard, you are curating data according to the AIRR Standard.

If you want to be truly reproducible, at any point in the processing pipeline, I should be able to use the AIRR Standard to go from one processing step to another processing step, and able to reproduce where a piece of data came from.

Here is my curator use case. I don't need to be using the ADC for this, this could be using studies curated for analysis and stored completely on disk using the AIRR format files for repertoire, rearrangement, cell, clone, etc.

As a data curator if I want to confirm that data in my AIRR files (or my ADC repository) is correct, I SHOULD be able go back to my source files and confirm this is indeed the case. When I lost sequence_id, I lost the ability to do that for the original fastq files - damn, but hey it is only the sequence that we are talking about, and we have millions 8-)

But now we are talking about cells, which have complicated linkages across rearrangements, clones, cells, and gex data. In the case of annotation tools, these linkages are across many files. So when I process some 10X studies (N samples from one study and M samples from another study) generating AIRR compliant files in preparation for analysis, I replace the source 10X cell_id with a unique AIRR cell_id to make sure cell_id is unique across my analysis of interest.

Now I want to confirm that the data I just processed for a certain 10X cell_id (TACGGATGTACACCGC-1) from a single subject in my source data is correct across the data I am going to use for my analysis. I can't...

Similarly, if I want to look at an AIRR unique cell_id in my processed data and then find the source information in the original 10X produced data files. Again, I can't...

So we have broken the link between the data in the AIRR compliant files to the original source data - data/"information" can no longer be mapped between the two...

Now if you truly trust the tools that do all of that processing, then maybe you don't want to do any provenance or reproducibility checks... But that is not how I would do things 8-)

Here is an example of what you get from a repository with our current implementation. If I maintain the annotation tool cell_id in some form, I can cross check the validity of the data I loaded with the original 10X files. If I don't, I can't... If you are a data steward maintaining an ADC repository, this is an important step...

Basically I want to be able to ensure that cell_id_annotation_tool = TACGGATGTACACCGC-1 links the correct data in the original 10X files (ERS1-TRA.tsv, ERS1-vdj_t_gex.json, ERS1-vdj_t-cells.json) that I as the data curator have maintained...

 curl -d '{"fields":["study.study_id","sample.sample_id", "sample.sequencing_files.filename", "data_processing.data_processing_files"]}' http://single-cell.ireceptor.org/airr/v1/repertoire

[Some stuff deleted/edited]

{
            "study": {
                "study_id": "PRJCA002413"
            },
            "sample": [
                {
                    "sample_id": "ERS1",
                    "sequencing_files": {
                        "filename": "CRR126571_f1.fastq.gz, CRR126572_f1.fastq.gz, CRR126573_f1.fastq.gz, CRR126574_f1.fastq.gz"
                    }
                }
            ],
            "data_processing": [
                {
                    "data_processing_files": [
                        "ERS1-TRA.tsv"
                    ]
                }
            ]
},
{
            "study": {
                "study_id": "PRJCA002413"
            },
            "sample": [
                {
                    "sample_id": "ERS1",
                    "sequencing_files": {
                        "filename": "CRR126563_f1.fastq.gz, CRR126564_f1.fastq.gz, CRR126565_f1.fastq.gz, CRR126566_f1.fastq.gz"
                    }
                }
            ],
            "data_processing": [
                {
                    "data_processing_files": [
                        "ERS1-vdj_b_gex.json",
                        "ERS1-vdj_b-cells.json",
                        "ERS1-vdj_t_gex.json",
                        "ERS1-vdj_t-cells.json"
                    ]
                }
            ]
}
 

@scharch
Copy link
Contributor

scharch commented Feb 22, 2022

@bcorrie I am happy to stipulate to the importance of being able to trace the provenance of piece of data. But I am going to respond to the rest in the new Provenance object thread (#589) so that we don't crush all of @javh's hopes and dreams...

@schristley
Copy link
Member Author

In any case, what you're describing seems to be a "backend" ADC feature/use, so I don't think it should complicate end user-facing *_id fields.

I don't agree - throwing away information that an annotation tool provides has nothing to do with the ADC. This is 100% a curation process issue.

@bcorrie We don't have to keep going round and round this in this issue. I brought up the issue initially, and I was happy with doing a custom solution, but you'd like something more formal, which is fine. That's been recognized with #589 and we can discuss solutions there. Let's get this issue back onto its main track of FAIR for ADC objects.

@javh
Copy link
Contributor

javh commented Feb 22, 2022

@schristley

Call them _pid fields if it helps; they should contain whatever is necessary for persistent access to the object.

I don't think it helps. At least, not as I'm interpreting it. The _ref being foreign is the rub. Which, I think is fine as metadata, but won't work as an ID in the ADC because you can't update the foreign record (eg, to fix v_call, remove sequencing adapters, or whatever).

I guess the question is whether that's a problem.

@schristley
Copy link
Member Author

@schristley

Call them _pid fields if it helps; they should contain whatever is necessary for persistent access to the object.

I don't think it helps. At least, not as I'm interpreting it. The _ref being foreign is the rub. Which, I think is fine as metadata, but won't work as an ID in the ADC because you can't update the foreign record (eg, to fix v_call, remove sequencing adapters, or whatever).

I'm not sure what you mean by "foreign". If you are thinking "foreign key", that's not what is meant. I also don't understand how "update the foreign record" matters. This is persistent access to a read-only object.

According to FAIR, (meta)data are assigned a globally unique and persistent identifier. There isn't the requirement that these two attributes are satisfied by a single field. For example, IEDB splits them into two fields, one which is the identifier (which doesn't look globally unique but is because IEDB is a central database), and another which is the IRI for persistence.

Reference ID | Reference IRI | Epitope ID | Epitope IRI
-- | -- | -- | --
1004580 | http://www.iedb.org/reference/1004580 | 16878 | http://www.iedb.org/epitope/16878

@javh
Copy link
Contributor

javh commented Feb 22, 2022

@schristley, Ah, I see... maybe. I'm getting my signals crossed here. I was thinking of _ref as described in the Germline schema and discussed in the last call. Which is, for example, the GenBank accession providing evidence for a novel allele, so, yes, a foreign key.

The _ref you're describing is the _pid field we've been discussing in this thread, except that it is not being used as the ADC linking identifier. Correct?

@schristley
Copy link
Member Author

@schristley, Ah, I see... maybe. I'm getting my signals crossed here. I was thinking of _ref as described in the Germline schema and discussed in the last call. Which is, for example, the GenBank accession providing evidence for a novel allele, so, yes, a foreign key.

The _ref you're describing is the _pid field we've been discussing in this thread, except that it is not being used as the ADC identifier. Correct?

Right. Sorry, I was mentioning _ref in terms of germline_set_ref which is essentially a persistent IRI that is separate from the identifier germline_set_id, and not the references to foreign records.

IMO, germline_set_ref satisifies both the global uniqueness and persistence, so there really isn't a need for two fields...

  • CON: If 2 uses a CURIE-like resolver, it seems redundant; might as well just use 1.

@schristley
Copy link
Member Author

I just thought of another major CON for doing 2 instead of 1.

  • CON: With 2, all references to an object must include both fields because the _id isn't sufficient to resolve the object.

For example, say I had a rearrangement record that references a clone_id, but the Clone data is not provided as part of the data set. The clone_id is insufficient to get the clone data, I would also need clone_pid (or clone_ref) so that I could resolve and download the object. This implies that in all of the AIRR objects, we would need both fields, creating a lot of additional fields to be maintained.

@schristley
Copy link
Member Author

Thinking about the actual content of the identifier, if we go with a CURIE-like structure, where we need a resolver, we can support decentralized identifiers later on, if we want. It would just involve extending the resolver code. We can support both and repositories can pick the one they want to implement.

The other thing is whether a type is needed as part of the identifier:

repository:type:code

vdjserver:repertoire:124
vdjserver:germline_set:145
vdjserver:clone:567

But this maybe isn't needed? The reason is the field, repertoire_id, germline_set_id, clone_id, etc., is essentially defining the type. If the identifier is in repertoire_id then we know it's a repertoire, if it is in clone_id then we know it's a clone, and so on. In the AIRR schema, we don't mix and match identifier types in the same field, nor do we have generic fields. Those this means resolving requires knowing the context (field) of the identifier, if somebody just gave you a value vdjserver:124, it couldn't be resolved properly by itself. Maybe this goes against the identifier being "self-contained"?

Another point is that the complete value is the identifier value, so an ADC API call for that specific repertoire_id would be

https://vdjserver.org/airr/v1/repertoire/vdjserver:124

Likewise, when sending a POST query

{
    "filters":{
                "op":"in",
                "content": {
                    "field":"repertoire_id",
                    "value":[
                        "vdjserver:2366080924918616551-242ac11c-0001-012",
                        "vdjserver:2541616238306136551-242ac11c-0001-012",
                        "vdjserver:1993707260355416551-242ac11c-0001-012",
                        "vdjserver:1841923116114776551-242ac11c-0001-012"
                    ]
                }
    }
}

If this wasn't the case, that is, if just the trailing code (or number) was the identifier, users would have to constantly parse the value to pull out the appropriate bits.

This also mean that our CURIE-like resolver cannot manipulate the identifier in any way, which is done for some ontology fields. If the identifier values change, for queries, for data returned from the ADC, etc., then it fails at being an identifier and objects cannot be linked.

@bcorrie
Copy link
Contributor

bcorrie commented Feb 6, 2024

I am thinking that this issue is probably not going to be resolved in v2.0 (and doesn't need to be resolved in 2.0). Moving this to v2.1.

@bcorrie bcorrie modified the milestones: AIRR 2.0, AIRR 2.1 Feb 6, 2024
@schristley
Copy link
Member Author

@bcorrie In some sense, I think we are making this issue more complicated than it needs to be, at least in the context of the ADC. All we need to do is make these identifiers (in the ADC) be CURIEs. The prefix part points to the global service, i.e. the ADC repository, and the local identifier part can be whatever that is interpreted by the ADC repository. I think that James' presentation of LinkML and his discussion of CURIEs shows that it works quite well for creating globally unique identifiers that can be resolved and be FAIR.

AKC is going to need them. The question is do we implement them first in the data integration scripts (ADC --> AKC) as a test then port them back into the ADC, or just put them in the ADC first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADC API V2 AIRR Data Commons API V2
Projects
None yet
Development

No branches or pull requests

5 participants