We need a more formal, fully-qualified identifiers for repository objects #347

schristley · 2020-03-11T18:19:33Z

This came up in a side discussions here and here. Creating a separate issue as those other issues are becoming overloaded with multiple topics.

The id fields we are defining in the AIRR Data Model aren't complete digital object identifiers required by FAIR when taken in context of the AIRR Data Commons because they don't indicate where that object is stored, i.e. they are missing the (F)indable attribute.

Here's what I believe are the key issues and requirements:

There are a key set of identifier fields for linking AIRR objects in the AIRR Data Model
There are two primary scopes for AIRR objects: 1) local analysis scope, and 2) ADC
We would like to define uniqueness criteria for these identifiers so tools can use data from both scopes without requiring special coding to handle those scopes.
For the local analysis scope, tools often aren't concerned with (or aware of) the larger context and might assign identifiers that are only unique in the local scope.
We would like the uniqueness criteria for objects in the ADC to be such that 1) there is no conflict in identifiers across different repositories and 2) the identifier can be used to resolve back to the specific object in the data repository.
F in FAIR says that (meta)data are assigned a globally unique and persistent identifier.
We can specify rules that apply uniformly to both scopes, or we can specify rules specific to each scope.

The text was updated successfully, but these errors were encountered:

bcorrie · 2020-03-11T21:40:35Z

Just to clarify, are we talking formal DOI as in: https://www.doi.org/

I think at a minimum an AIRR compliant repository should have a formal DOI.

Beyond that, I am not sure how far down the DOI path we should go... It makes some sense to me that the study data for a specific study in a specific repository could use a DOI, but given that most studies, through their publication, would already have a DOI, this might be overkill. If we added a study_doi field to the metadata (for the publication DOI), that might cover it. If referring to the data in a specific study in the AIRR Data Commons, the combination of the Repository DOI and the Study DOI (findable in the study_doi field) would suffice.

My gut feeling is that going down the DOI path much further than that might be overkill (formal DOI generation requires a DOI provider), but certainly we could and possibly should use UUID as per https://tools.ietf.org/html/rfc4122.html for internal object to provide a uniqueness criteria. They are easy to generate in that many languages have libraries that generate them...

I would also note that the study_id field can be considered a unique identifier if the definition in the spec is followed as per Unique ID assigned by study registry assuming the study registries assign non-overlapping IDs.

schristley · 2020-03-11T22:05:27Z

Just to clarify, are we talking formal DOI as in: https://www.doi.org/

No, just DOI in the context of the FAIR standard, which doesn't require the doi.org service to be used. The FAIR paper defines DOI this way:

DOI—Digital Object Identifier; a code used to permanently and stably identify (usually digital) objects. DOIs provide a standard mechanism for retrieval of metadata about the object, and generally a means to access the data object itself.

So I think using a URL (https://vdjserver.org/airr/v1/repertoire/abc) to access the data object itself would be sufficient.

I think at a minimum an AIRR compliant repository should have a formal DOI.

I think this might be worthwhile, but we should probably lump this into the discussion about the "registry" which the CRWG hasn't really defined yet...

My gut feeling is that going down the DOI path much further than that might be overkill.

Me too, so this isn't about digging a deeper hole. It's about how we are going insure that when you get an AIRR file (off the web, sent in email, supplemental file with an article, etc.), that you can back to the original object in the data repository.

I think of this as a provenance issue, but it is also a practical issue. I may give you a Cell file but say, use the DOIs in that file to download the rearrangements. Right now, the standard _ids aren't enough sufficient to do a http request and download the data.

bussec · 2020-03-17T01:56:06Z

No, just DOI in the context of the FAIR standard, which doesn't require the doi.org service to be used.

To my understanding there is only one type of DOI and that's the one governed by doi.org. I agree that the Wilkinson et al. describe it as it would be a generic term, but IMO it's not. Out of curiosity I just checked on the costs, and at 0.06 USD per DOI it would be feasible to create DOIs at least for study objects (fees can be found at https://www.crossref.org/fees/ ).

The advantage of a DOI vs an UUID is that it is clear how to resolve it. However, I don't know whether the record it resolves to is clearly defined. IMO it would not hurt if the data of a study has a separate DOI then the publication located at a publisher's site.

But I agree that we don't want to create DOIs for each single Rearrangement :-)

bcorrie · 2020-04-05T19:49:30Z

Is this something we need to resolve for ADC API v1?

bussec · 2020-04-05T23:02:20Z

Summarizing a discussion that @schristley, @bcorrie and me had via mail. Will probably not require any direct action, just putting it here for future reference:

The generic term for the feature we are looking for is "persistent identifier" (PID), of which the DOI would be a specific implementation. EOSC has an own sub-working group to address PID usage, who recently published a document on it [DOI:10.5281/zenodo.3574203]. In the document a PID is defined as:

globally unique
persistent
resolvable

The question is whether we really need all these feature for all AIRR objects, i.e., how far would we go with PIDs, when would (non-resolvable) UUIDs come in handy, and where do we only need local uniqueness?

The four main levels that a PID could be applied to are:

Repository: It was already suggested by @bcorrie that AIRR-compliant repositories should have PIDs. iReceptor Public Archive and VDJServer already have DOIs through fairsharing.org. Whether this is a recommendation (to enhance citability) or a requirement (mandatory ID within a standard) is up for discussion. This will also depend on the way how (and whether) the subsequent PIDs will be resolved.
Study: In most cases there will be a DOI assigned to the related publication, however we need to keep in mind that this refers to a different object class, i.e., scholarly communication instead of the actual data set. This is usually sufficient for a human curator to quickly find the associated data sets, but this might not be true for a computer. BioProject IDs referring to a study can be considered to be PIDs (as the resolver is broadly known), but will refer to a record at INSDC, not in an AIRR repository.
Repertoire: While this is also a good candidate for a PID there are a couple of points to consider:
1. We clearly need PIDs for data sets of a given study. However, whether we need them at the level of Sample or on the level of Repertoire needs further discussion.
2. Repertoire currently does not have a strict definition (also see Cell and Repertoire definition #361) and Repertoire objects can be generated dynamically during queries. This could lead to inflation of PIDs and it is questionable whether there is any added value to this.
3. As this is an AIRR-specific object, we need to find ways to mint and administer these PIDs.
Rearrangement: Most certainly not, as PIDs usually refer to a data set (e.g., a table), not the individual datum (e.g. a line within the table). Furthermore it currently hard to see a use case for this as long as we have the possibility to create arbitrary and non-overlapping sets of rearrangements, i.e., repertoires.

schristley · 2020-04-06T18:12:24Z

Is this something we need to resolve for ADC API v1?

probably not, we mainly need for the new (experimental) objects like Clone, Cell and etc., so we can resolve in concert with their release.

schristley · 2021-06-15T03:35:48Z

I've reviewed the W3C standard for decentralized identifiers, and it looks like it will work quite well for our purposes. I'm considering this standard just for the identifiers in the AIRR Data Model used to reference AIRR objects, external identifiers outside our control are handled with #464

A decentralized identifier (DID), has a simple syntax consisting of three parts, a colon separates the three parts:

did:example:identifier

where did is static and defines this as a decentralized identifier, example is called the DID method, and identifier is called the DID method-specific identifier. The DID spec places few limitations on the identifier part; we can even have additional colons in it if we want.

The DID method is the key part. It is somewhat analogous to the first part of a CURIE. It's creating the unique namespace for the identifiers. Also, according to the spec, "a DID method defines how implementers can realize the features described by this specification". We need to define a DID method and SHOULD register it with DID Registry. So in an odd twist, creating a decentralized identifier suggests registering in a central repository namespace... though it's not mandatory.

Anyways, my suggestion is we define and register the airr DID method. That is, all AIRR DIDs look like this:

did:airr:identifier

The DID spec talks a lot about verification, security, and etc., but all of those capabilities are optional. The DID method must implement a number of functions for DID resolution and URL dereferencing, though the spec leaves it almost completely open for how the DID method does that. Conceptually I find this very similar to how we are resolving CURIE identifiers, and I believe we can implement much the same for DIDs.

What's left for us to consider is how to define the DID method-specific identifier. There is no requirement in the DID spec that the TYPE of resource, which the DID references, must be the same. So we could do something simple with just numbers, like this, but this doesn't provide us enough flexibility as we want identifiers to resolve to difference repositories in the ADC.

did:airr:123
did:airr:124
did:airr:567

With the DID method as airr, that provides a global AIRR namespace, and it's up to us if we want to impose additional structure and sub-namespaces to it. My suggestion is that we define an additional repository sub-namespace:

did:airr:repository:identifer

Then DIDs look like this:

did:airr:ipa:123
did:airr:vdjserver:124
did:airr:orgrdb:567

but even this isn't quite complete, because does did:airr:vdjserver:124 refer to a Repertoire, a Rearrangement, or another AIRR object? If we want to do full URL dereferencing, like we are doing with CURIE, we need to know the type to know the proper ADC API end point to hit. Here's where I think we have options, one simple idea is to add another sub-namespace level that defines the type.

did:airr:repository:type:identifer

did:airr:ipa:repertoire:123
did:airr:vdjserver:repertoire:124
did:airr:vdjserver:germline_set:124
did:airr:orgrdb:germline_set:567

But it's equally valid to combine those two sub-namespaces together into one like so, we have complete control over how the format.

did:airr:repository_and_type:identifer

did:airr:ipa_repertoire:123
did:airr:vdjserver_repertoire:124
did:airr:vdjserver_germline_set:124
did:airr:orgrdb_germline_set:567

Currently, I prefer the first option with two namespaces.

Hopefully now we can see how DIDs can be implemented. Like CURIEs, we have a resolution table in the AIRR schema that define how ipa, vdjserver and ogrdb can be de-referenced into URLs.

did:airr:vdjserver:repertoire:124
==>
https://vdjserver.org/airr/v1/repertoire/124

bussec · 2021-10-07T19:43:40Z

In case we decide against decentralized identifiers, the URN service run by GEANT might be potential way to be able to coin PIDs without having to run the registry:

https://tools.ietf.org/html/rfc4926
https://wiki.geant.org/display/URN/Registry+Home

schristley · 2022-01-21T18:11:38Z

The dual usage/requirements for identifiers that link/reference AIRR objects within the AIRR Data Model continue to bite us. The dual usage being:

user-defined (or tool-assigned) identifier values that are locally consistent within files for running analysis tools and such.
globally unique identifier values for ADC objects that are FAIR.

For sequence_id, we agreed that the ADC can overwrite the identifier value with their own to provide a PID for the rearrangement record in the repository. This really isn't optimal because there are efforts around analysis reproducibility where we'd like to backtrack to the original sequence in a raw data file, and thus want that original sequence_id. This becomes more problematic with fields like cell_id and clone_id where annotations tools generate those ids and use them throughout multiple file/records. Overwriting the identifier in the ADC loses any linkage with data stored outside ADC. Thus, I believe we really need to consider maintaining original identifier values (like we do with subject_id, sample_id) as separate from ADC PIDs.

An easy idea is to separate the fields, i.e. have *_pid fields which hold the ADC PID while the *_id fields hold the original user-defined (tool assigned) value. Yes, it creates more fields but at least the semantics for each are clear and precise. This seems like it can work, except for one key problem. What do tools do?

Today, tools assume that *_id are unique within the local context of data files, but there is a scenario where that breaks down: downloading multiple studies from the ADC, and then combining the multiple study data together. There's no guarantee that a clone_id or *_id from one study doesn't conflict with the *_id values from another study. However, if the tools used the *_pid fields instead then they got uniqueness. But that complicates tool logic, every time they want to use an identifier, they have to decide should they use the *_pid or the *_id fields.

One could make the argument, like with subject_id and sample_id, that the uniqueness is only guaranteed with the study and/or within the Repertoire, and thus tools need to use compound keys, e.g. study_id, repertoire_id, data_processing_id, clone_id. Unfortunately, this doesn't completely solve the problem as the top-level objects (Repertoire, RepertoireGroup, DataProcessing) can still have conflicts. That is, if a tool assigns repertoire_id for local files, and the ADC assigns repertoire_pid, there's no guarantee repertoire_id is unique across studies, so we are back to the same problem, though maybe now it's less # of fields to consider, i.e. only the top-level AIRR objects.

Another idea is to not have *_pid fields, but instead when the data is loaded into the ADC, the *_id fields are assigned the PID, and the original user-defined value is put in a *_original_id field. So it's still the idea of having separate fields. In this case though, tools can continue to assume *_id are unique, and the problem of combining data from multiple studies from the ADC is solved. The exception is if tools want to link ADC data with external data, it will have to know to use the *_original_id fields.

I don't see a solution that doesn't requiring having separate fields if we want to store both the original identifier value and the ADC PID. Any other ideas?

To summarize:

Have separate *_id and *_pid fields. Tools will need logic to use one or the other to do lookups.
Have separate *_id and *_pid fields, but rely upon the compound nature of the *_id to define scope, e.g. clone_id is only unique with a repertoire, study and data processing. Tools would need to use that compound nature when doing lookups. There would still be potential conflict for top-level AIRR objects, so tools will need logic to use *_id or *_pid for them.
Have separate *_id and *_original_id fields. ADC can overwrite *_id with PID and puts the original value in *_original_id. Tools can assume *_id is unique and links ADC objects, when linking to external non-ADC data, the *_original_id fields need to be used.

bussec · 2022-01-24T03:22:35Z

My 2 cents on this:

I agree with the general idea of having separate fields
To maintain the original value is an important but less frequent use case. In addition, using the original identifiers for linkage to non-ADC data might be subject to further ambiguity (see below). Therefore I would give preference to an *_id/*_original_id solution.
Compound IDs that rely on understanding the structure of the AIRR Schema seem complex and potentially error-prone to me.
Are we sure that there will always be only a single original ID to store?

javh · 2022-01-24T17:00:35Z

*_id and *_original_id makes the most sense to me as well. But, this seems like a pretty specific use case that might be a job for a custom field.

Can I throw out a 4th option? What about some sort of provenance object to store these relationships? It'd be essentially the same thing as *_original_id, but stored in a separate table. It would also lend itself to lists of original identifiers, links to DataProcessing for how the change was made, etc.

bcorrie · 2022-01-25T23:03:13Z

The problem with the _original_id concept is that the field names in the source files will be of the _id form. 10X produces data with clone_id and cell_id in their files (same with Immcantation etc, no), and these naturally map to the field names in the spec. It seems more natural to me to maintain those fields as they are in the original data from the annotation tool and to have a new, specific field that more explicitly states what it is. For example, it it is truly a persistent ID as per the FAIR PID definition, then maybe it should be _pid but it it is "just" a CURIE that turns the ID in the repository into something that is globally unique maybe it should be a _gid for global identifier?

bcorrie · 2022-01-25T23:06:52Z

I think the question of having globally unique identifiers for objects in ADC repositories and managing provenance and how such globally unique objects are related to each other are two different topics, no?

bcorrie · 2022-01-25T23:15:09Z

did:airr:repository:type:identifer
did:airr:vdjserver:repertoire:124
==>
https://vdjserver.org/airr/v1/repertoire/124

BTW, I like this structure because the vdjserver part maps directly to the servers objects in the OpenAPI 3.0 spec. In that way, the DID structure and the OpenAPI server and path objects map nicely.

schristley · 2022-02-21T22:41:29Z

I'm seeing two potential solutions:

_id fields satisfy the both global uniqueness and persistence. This implies some CURIE-like value that provides both properties.
separate global uniqueness and persistence. The _id fields have global uniqueness, and separate _ref fields contain a persistent reference.

We've mainly been considering 1 but GermlineSet uses 2 in its draft. Here are pros/cons that I can think of:

PRO: 1 has less fields.
PRO: the CURIE-like structure of 1 almost guarantees global uniqueness.
CON: 1 requires a resolver, to interpret the value and translate into a URL. CURIE prefixes need to be stored in Schema; we would need to update whenever a new data repository (new prefix) is added to ADC. We might get around this by having an ADC registry.
PRO: 2 could use a resolver but it could simply be the direct URL, e.g., https://vdjserver.org/airr/v1/repertoire/1159043104164212245-242ac114-0001-012
CON: If 2 uses a CURIE-like resolver, it seems redundant; might as well just use 1.
PRO: Using a resolver allows flexibility in the data repository, i.e. hostnames can change, resolvers can be updated with new features, etc.
CON: Having a fixed URL in 2 provides less flexibility, in order to be persistent that host/API must always be available.
CON: 1 requires re-assigning the identifier values in the ADC, for example, a repertoire_id might be vdjserver:123. This could be heavy burden on data repositories as they might need to update all the records in the database (they could also do some translation on the fields during input/output)
CON: For 1, tools have no way to assign the persistent value, so the ADC would always need to overwrite the _id values.
PRO: For 2, tools that assign UUIDs, those UUIDs could potentially be kept when loading into the ADC.
CON: For 2, tools that don't assign UUIDs, the ADC would need to overwrite the _id values.

Any other pros/cons?

Regardless of 1 or 2, the ADC needs the ability to overwrite any local values assigned by tools when data is loaded into the ADC.

IMO, I'm leaning toward 1 at the moment. The main CON is it requires re-assigning identifier values in the ADC, but I think the flexibility of a CURIE-like resolver is a significant PRO.

scharch · 2022-02-22T15:54:05Z

Groundhog Day thread (#340)

😱

scharch · 2022-02-22T16:14:00Z

My argument is that the since the cell_id is a field that is produced by many pipelines that process Cell data, maybe it isn't a great idea for us to use that field name as the field that contains a PID for a cell (which is what the ADC requries). We absolutely need a PID field, but I think it is a mistake to throw the tool generated linking field across these files away!

But we already do this for sequence_id, and I can't see how cell_id (or clone_id or data_processing_id or...) is any different. I do understand the desire to be able to trace data back to its source, but the nature of the schema already limits this: sequence_ids can't really be traced back to raw fastqs without a lot of work to re-execute the DataProcessing, and even that assumes that the DataProcessing is actually complete/fully specified and the link out to SRA/etc is stable and correct. And in some cases (especially Tree generation), even a complete DataProcessing may not be deterministic...

In any case, what you're describing seems to be a "backend" ADC feature/use, so I don't think it should complicate end user-facing *_id fields. We've talked in the past about ADC-specific extensions to the schema, and a Provenance object seems like a good fit for that category...

bcorrie · 2022-02-22T16:14:38Z

@bcorrie This thought is completely bizarre and bewildering to me.

It also seems bizarre and bewildering to me that we are so adamant that we throw this information away! Why is there such a reluctance to having an extra field that captures this info as part of the standard? There is a very strong data curation use case to keep it, so I am also bewildered... 8-) The standard isn't just about analysis, but data reusability and data curation.

bcorrie · 2022-02-22T16:15:25Z

But we already do this for sequence_id, and I can't see how cell_id (or clone_id or data_processing_id or...) is any different.

Yep, and I argued strongly against that one too - but caved in because it was only one field...

scharch · 2022-02-22T16:17:53Z

It also seems bizarre and bewildering to me that we are so adamant that we throw this information away! Why is there such a reluctance to having an extra field that captures this info as part of the standard? There is a very strong data curation use case to keep it, so I am also bewildered... 8-) The standard isn't just about analysis, but data reusability and data curation.

Because there is no "information" there that is being discarded! And trying to preserve the original value of the field by adding a new field pollutes the schema without adding any analysis benefit in the ways that @javh and I have been arguing through (apparently) two entire threads now :-)

bcorrie · 2022-02-22T16:19:37Z

In any case, what you're describing seems to be a "backend" ADC feature/use, so I don't think it should complicate end user-facing *_id fields.

I don't agree - throwing away information that an annotation tool provides has nothing to do with the ADC. This is 100% a curation process issue.

scharch · 2022-02-22T16:24:36Z

So if I ever want to go back and try to understand something about the data in my repository by looking at the original data, I can. That is only possible if the tool produced cell_id is stored in the repository and can be referenced in the original data.

This implies that you are also storing the entire dataset in its original format somewhere accessible-but-outside-of-the-ADC?!? But isn't the point of the ADC to be the copy of record so that the original becomes irrelevant? Do you really have 2 copies of everything in iReceptor?

scharch · 2022-02-22T16:26:46Z

In any case, what you're describing seems to be a "backend" ADC feature/use, so I don't think it should complicate end user-facing *_id fields.

I don't agree - throwing away information that an annotation tool provides has nothing to do with the ADC. This is 100% a curation process issue.

It's not "information." Metadata, perhaps. And if curation isn't part of the ADC, then who are we doing this for? It's not part of the end-user data reuse process...

javh · 2022-02-22T17:38:58Z

The _id fields have global uniqueness, and separate _ref fields contain a persistent reference.

I think having _ref fields is fine, but I don't see them as a solution here. If a _ref is a foreign key / citation when uploaded, there's no guarantee that it's going to remain a valid reference in the future and it can't be trusted as a linking identifier in the ADC. If I'm understanding the _ref field correctly, then it's really just a more formal comment string.

schristley · 2022-02-22T17:48:14Z

The _id fields have global uniqueness, and separate _ref fields contain a persistent reference.

I think having _ref fields is fine, but I don't see them as a solution here. If a _ref is a foreign key / citation when uploaded, there's no guarantee that it's going to remain a valid reference in the future and it can't be trusted as a linking identifier in the ADC. If I'm understanding the _ref field correctly, then it's really just a more formal comment string.

Call them _pid fields if it helps; they should contain whatever is necessary for persistent access to the object. The point being that the persistence attribute is separated from the global uniqueness attribute. Regardless, I don't think they need to be separated, but I offered it as an alternative solution in case somebody thought of some PROs.

javh edit: Sorry @schristley, I accidentally edited this instead of quoting (I don't know how). Should be restored now.

bcorrie · 2022-02-22T17:49:06Z

So if I ever want to go back and try to understand something about the data in my repository by looking at the original data, I can. That is only possible if the tool produced cell_id is stored in the repository and can be referenced in the original data.

This implies that you are also storing the entire dataset in its original format somewhere accessible-but-outside-of-the-ADC?!? But isn't the point of the ADC to be the copy of record so that the original becomes irrelevant? Do you really have 2 copies of everything in iReceptor?

Nope, but we want to support reproducibility where ever we can... So no data in the pipeline is ever really irrelevant.

The point of the ADC is data sharing, data reuse, and reproducibility. I would argue that is also the point of the AIRR Standard as well. The AIRR Standard points to source records of information throughout. SRA files (RawSequenceData object), INSDC Bioproject information (study_id), the files in which the annotated data came from (data_processing_files), etc. These are critical for reproducibility. We don't store everything, but we try to make it possible to reproduce everything...

Curation is part of this entire process - it is not specific to the ADC. If you describe a study using the AIRR Standard, you are curating data according to the AIRR Standard.

If you want to be truly reproducible, at any point in the processing pipeline, I should be able to use the AIRR Standard to go from one processing step to another processing step, and able to reproduce where a piece of data came from.

Here is my curator use case. I don't need to be using the ADC for this, this could be using studies curated for analysis and stored completely on disk using the AIRR format files for repertoire, rearrangement, cell, clone, etc.

As a data curator if I want to confirm that data in my AIRR files (or my ADC repository) is correct, I SHOULD be able go back to my source files and confirm this is indeed the case. When I lost sequence_id, I lost the ability to do that for the original fastq files - damn, but hey it is only the sequence that we are talking about, and we have millions 8-)

But now we are talking about cells, which have complicated linkages across rearrangements, clones, cells, and gex data. In the case of annotation tools, these linkages are across many files. So when I process some 10X studies (N samples from one study and M samples from another study) generating AIRR compliant files in preparation for analysis, I replace the source 10X cell_id with a unique AIRR cell_id to make sure cell_id is unique across my analysis of interest.

Now I want to confirm that the data I just processed for a certain 10X cell_id (TACGGATGTACACCGC-1) from a single subject in my source data is correct across the data I am going to use for my analysis. I can't...

Similarly, if I want to look at an AIRR unique cell_id in my processed data and then find the source information in the original 10X produced data files. Again, I can't...

So we have broken the link between the data in the AIRR compliant files to the original source data - data/"information" can no longer be mapped between the two...

Now if you truly trust the tools that do all of that processing, then maybe you don't want to do any provenance or reproducibility checks... But that is not how I would do things 8-)

Here is an example of what you get from a repository with our current implementation. If I maintain the annotation tool cell_id in some form, I can cross check the validity of the data I loaded with the original 10X files. If I don't, I can't... If you are a data steward maintaining an ADC repository, this is an important step...

Basically I want to be able to ensure that cell_id_annotation_tool = TACGGATGTACACCGC-1 links the correct data in the original 10X files (ERS1-TRA.tsv, ERS1-vdj_t_gex.json, ERS1-vdj_t-cells.json) that I as the data curator have maintained...

 curl -d '{"fields":["study.study_id","sample.sample_id", "sample.sequencing_files.filename", "data_processing.data_processing_files"]}' http://single-cell.ireceptor.org/airr/v1/repertoire

[Some stuff deleted/edited]

{
            "study": {
                "study_id": "PRJCA002413"
            },
            "sample": [
                {
                    "sample_id": "ERS1",
                    "sequencing_files": {
                        "filename": "CRR126571_f1.fastq.gz, CRR126572_f1.fastq.gz, CRR126573_f1.fastq.gz, CRR126574_f1.fastq.gz"
                    }
                }
            ],
            "data_processing": [
                {
                    "data_processing_files": [
                        "ERS1-TRA.tsv"
                    ]
                }
            ]
},
{
            "study": {
                "study_id": "PRJCA002413"
            },
            "sample": [
                {
                    "sample_id": "ERS1",
                    "sequencing_files": {
                        "filename": "CRR126563_f1.fastq.gz, CRR126564_f1.fastq.gz, CRR126565_f1.fastq.gz, CRR126566_f1.fastq.gz"
                    }
                }
            ],
            "data_processing": [
                {
                    "data_processing_files": [
                        "ERS1-vdj_b_gex.json",
                        "ERS1-vdj_b-cells.json",
                        "ERS1-vdj_t_gex.json",
                        "ERS1-vdj_t-cells.json"
                    ]
                }
            ]
}

scharch · 2022-02-22T18:16:27Z

@bcorrie I am happy to stipulate to the importance of being able to trace the provenance of piece of data. But I am going to respond to the rest in the new Provenance object thread (#589) so that we don't crush all of @javh's hopes and dreams...

schristley · 2022-02-22T18:28:27Z

In any case, what you're describing seems to be a "backend" ADC feature/use, so I don't think it should complicate end user-facing *_id fields.

I don't agree - throwing away information that an annotation tool provides has nothing to do with the ADC. This is 100% a curation process issue.

@bcorrie We don't have to keep going round and round this in this issue. I brought up the issue initially, and I was happy with doing a custom solution, but you'd like something more formal, which is fine. That's been recognized with #589 and we can discuss solutions there. Let's get this issue back onto its main track of FAIR for ADC objects.

javh · 2022-02-22T19:29:25Z

@schristley

Call them _pid fields if it helps; they should contain whatever is necessary for persistent access to the object.

I don't think it helps. At least, not as I'm interpreting it. The _ref being foreign is the rub. Which, I think is fine as metadata, but won't work as an ID in the ADC because you can't update the foreign record (eg, to fix v_call, remove sequencing adapters, or whatever).

I guess the question is whether that's a problem.

schristley · 2022-02-22T19:40:57Z

@schristley

Call them _pid fields if it helps; they should contain whatever is necessary for persistent access to the object.

I don't think it helps. At least, not as I'm interpreting it. The _ref being foreign is the rub. Which, I think is fine as metadata, but won't work as an ID in the ADC because you can't update the foreign record (eg, to fix v_call, remove sequencing adapters, or whatever).

I'm not sure what you mean by "foreign". If you are thinking "foreign key", that's not what is meant. I also don't understand how "update the foreign record" matters. This is persistent access to a read-only object.

According to FAIR, (meta)data are assigned a globally unique and persistent identifier. There isn't the requirement that these two attributes are satisfied by a single field. For example, IEDB splits them into two fields, one which is the identifier (which doesn't look globally unique but is because IEDB is a central database), and another which is the IRI for persistence.

Reference ID | Reference IRI | Epitope ID | Epitope IRI
-- | -- | -- | --
1004580 | http://www.iedb.org/reference/1004580 | 16878 | http://www.iedb.org/epitope/16878

javh · 2022-02-22T19:45:13Z

@schristley, Ah, I see... maybe. I'm getting my signals crossed here. I was thinking of _ref as described in the Germline schema and discussed in the last call. Which is, for example, the GenBank accession providing evidence for a novel allele, so, yes, a foreign key.

The _ref you're describing is the _pid field we've been discussing in this thread, except that it is not being used as the ADC linking identifier. Correct?

schristley · 2022-02-22T19:51:13Z

@schristley, Ah, I see... maybe. I'm getting my signals crossed here. I was thinking of _ref as described in the Germline schema and discussed in the last call. Which is, for example, the GenBank accession providing evidence for a novel allele, so, yes, a foreign key.

The _ref you're describing is the _pid field we've been discussing in this thread, except that it is not being used as the ADC identifier. Correct?

Right. Sorry, I was mentioning _ref in terms of germline_set_ref which is essentially a persistent IRI that is separate from the identifier germline_set_id, and not the references to foreign records.

IMO, germline_set_ref satisifies both the global uniqueness and persistence, so there really isn't a need for two fields...

CON: If 2 uses a CURIE-like resolver, it seems redundant; might as well just use 1.

schristley · 2022-02-22T20:01:47Z

I just thought of another major CON for doing 2 instead of 1.

CON: With 2, all references to an object must include both fields because the _id isn't sufficient to resolve the object.

For example, say I had a rearrangement record that references a clone_id, but the Clone data is not provided as part of the data set. The clone_id is insufficient to get the clone data, I would also need clone_pid (or clone_ref) so that I could resolve and download the object. This implies that in all of the AIRR objects, we would need both fields, creating a lot of additional fields to be maintained.

schristley · 2022-02-23T21:44:39Z

Thinking about the actual content of the identifier, if we go with a CURIE-like structure, where we need a resolver, we can support decentralized identifiers later on, if we want. It would just involve extending the resolver code. We can support both and repositories can pick the one they want to implement.

The other thing is whether a type is needed as part of the identifier:

repository:type:code

vdjserver:repertoire:124
vdjserver:germline_set:145
vdjserver:clone:567

But this maybe isn't needed? The reason is the field, repertoire_id, germline_set_id, clone_id, etc., is essentially defining the type. If the identifier is in repertoire_id then we know it's a repertoire, if it is in clone_id then we know it's a clone, and so on. In the AIRR schema, we don't mix and match identifier types in the same field, nor do we have generic fields. Those this means resolving requires knowing the context (field) of the identifier, if somebody just gave you a value vdjserver:124, it couldn't be resolved properly by itself. Maybe this goes against the identifier being "self-contained"?

Another point is that the complete value is the identifier value, so an ADC API call for that specific repertoire_id would be

https://vdjserver.org/airr/v1/repertoire/vdjserver:124

Likewise, when sending a POST query

{
    "filters":{
                "op":"in",
                "content": {
                    "field":"repertoire_id",
                    "value":[
                        "vdjserver:2366080924918616551-242ac11c-0001-012",
                        "vdjserver:2541616238306136551-242ac11c-0001-012",
                        "vdjserver:1993707260355416551-242ac11c-0001-012",
                        "vdjserver:1841923116114776551-242ac11c-0001-012"
                    ]
                }
    }
}

If this wasn't the case, that is, if just the trailing code (or number) was the identifier, users would have to constantly parse the value to pull out the appropriate bits.

This also mean that our CURIE-like resolver cannot manipulate the identifier in any way, which is done for some ontology fields. If the identifier values change, for queries, for data returned from the ADC, etc., then it fails at being an identifier and objects cannot be linked.

bcorrie · 2024-02-06T21:58:22Z

I am thinking that this issue is probably not going to be resolved in v2.0 (and doesn't need to be resolved in 2.0). Moving this to v2.1.

schristley · 2024-03-02T19:02:07Z

@bcorrie In some sense, I think we are making this issue more complicated than it needs to be, at least in the context of the ADC. All we need to do is make these identifiers (in the ADC) be CURIEs. The prefix part points to the global service, i.e. the ADC repository, and the local identifier part can be whatever that is interpreted by the ADC repository. I think that James' presentation of LinkML and his discussion of CURIEs shows that it works quite well for creating globally unique identifiers that can be resolved and be FAIR.

AKC is going to need them. The question is do we implement them first in the data integration scripts (ADC --> AKC) as a test then port them back into the ADC, or just put them in the ADC first?

schristley added the ADC API V1 AIRR Data Commons API V1 label Mar 11, 2020

schristley mentioned this issue Mar 11, 2020

merge rearrangement_id -> sequence_id #340

Closed

schristley changed the title ~~We need a more formal, fully-qualified DOI for repository objects~~ We need a more formal, fully-qualified DOI for repository AIRR Data Model objects Mar 11, 2020

schristley changed the title ~~We need a more formal, fully-qualified DOI for repository AIRR Data Model objects~~ We need a more formal, fully-qualified DOI for repository objects Mar 11, 2020

bcorrie mentioned this issue Apr 8, 2020

Uniqueness of _id fields in airr_schema.yaml #246

Closed

bcorrie added ADC API V2 AIRR Data Commons API V2 and removed ADC API V1 AIRR Data Commons API V1 labels Apr 8, 2020

schristley mentioned this issue Jul 11, 2020

Representation of relations in on-disk format #439

Open

schristley changed the title ~~We need a more formal, fully-qualified DOI for repository objects~~ We need a more formal, fully-qualified identifiers for repository objects Jun 15, 2021

schristley added this to the AIRR v1.4.0 milestone Jun 15, 2021

schristley added the DataRep label Jun 15, 2021

schristley mentioned this issue Sep 15, 2021

Have fields refer to Persistent IDs as often as possible #464

Closed

bcorrie mentioned this issue Sep 17, 2021

Addition of a manifest object. #548

Merged

schristley mentioned this issue Jan 18, 2022

Updates to cell object as per #409 #574

Merged

javh mentioned this issue Feb 27, 2022

what's the scope of the identifiers in the germline schema objects? #562

Closed

9 tasks

bussec modified the milestones: AIRR v1.4.0, AIRR v2.0.0 Mar 21, 2022

bcorrie modified the milestones: AIRR 2.0, AIRR 2.1 Feb 6, 2024

schristley mentioned this issue Mar 5, 2024

CURIE conundrum airr-knowledge/issues#32

Closed

javh removed the DataRep label Sep 9, 2024

We need a more formal, fully-qualified identifiers for repository objects #347

We need a more formal, fully-qualified identifiers for repository objects #347

Comments

schristley commented Mar 11, 2020 • edited Loading

bcorrie commented Mar 11, 2020 • edited Loading

schristley commented Mar 11, 2020

bussec commented Mar 17, 2020 • edited Loading

bcorrie commented Apr 5, 2020

bussec commented Apr 5, 2020

schristley commented Apr 6, 2020

schristley commented Jun 15, 2021

bussec commented Oct 7, 2021

schristley commented Jan 21, 2022

bussec commented Jan 24, 2022

javh commented Jan 24, 2022

bcorrie commented Jan 25, 2022

bcorrie commented Jan 25, 2022

bcorrie commented Jan 25, 2022

schristley commented Feb 21, 2022

scharch commented Feb 22, 2022

scharch commented Feb 22, 2022

bcorrie commented Feb 22, 2022

bcorrie commented Feb 22, 2022

scharch commented Feb 22, 2022

bcorrie commented Feb 22, 2022

scharch commented Feb 22, 2022

scharch commented Feb 22, 2022

javh commented Feb 22, 2022

schristley commented Feb 22, 2022 • edited by javh Loading

bcorrie commented Feb 22, 2022

scharch commented Feb 22, 2022

schristley commented Feb 22, 2022

javh commented Feb 22, 2022 • edited Loading

schristley commented Feb 22, 2022

javh commented Feb 22, 2022 • edited Loading

schristley commented Feb 22, 2022

schristley commented Feb 22, 2022

schristley commented Feb 23, 2022

bcorrie commented Feb 6, 2024

schristley commented Mar 2, 2024

schristley commented Mar 11, 2020 •

edited

Loading

bcorrie commented Mar 11, 2020 •

edited

Loading

bussec commented Mar 17, 2020 •

edited

Loading

schristley commented Feb 22, 2022 •

edited by javh

Loading

javh commented Feb 22, 2022 •

edited

Loading

javh commented Feb 22, 2022 •

edited

Loading