Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniqueness of _id fields in airr_schema.yaml #246

Closed
bcorrie opened this issue Sep 11, 2019 · 61 comments
Closed

Uniqueness of _id fields in airr_schema.yaml #246

bcorrie opened this issue Sep 11, 2019 · 61 comments
Labels
ADC API V1 AIRR Data Commons API V1 documentation

Comments

@bcorrie
Copy link
Contributor

bcorrie commented Sep 11, 2019

@schristley I was looking at several of the _id fields in the schema, and I note in the descriptions we do not mention uniqueness criteria for many (any?) of them. I think this is a problem, isn't it??? Am I missing something?

If I go to the rearrangement level, we have several _ids (pair_id, clone_id, cell_id, rearrangement_id, repertoire_id, and data_processing_id). We don't advise or specify at what level something like a clone_id is unique... Or even a repertoire_id or data_processing_id. MiAIRR specifies that study IDs should be unique (typically an INSDC study related identifier) with subject_ids and sample_ids unique within studies.

It is not well defined what the relationship between _ids is from this level down (pair_id, clone_id, cell_id, rearrangement_id, repertoire_id, and data_processing_id)

One can probably infer (if you know the AIRR spec well) that repertoire_id should be unique at least within a study, maybe a subject. The reality is that the repertoire_id should be unique at the repository level (as they are the IDs returned by the /repertoire API endpoint), but that isn't actually stated in the spec unless I am missing something...

data_processing_id should be unique withing a repertoire_id at least. It feels like data_processing_id should be unique at the repository level as well, so you can easily identify a set of rearrangements that have been processed with the same data_processing without having to do a combined repertoire_id x data_processing_id query but again, nothing is explicitly stated in the spec.

pair_id, clone_id, and cell_id should probably be unique at least unique within a repertoire_id/data_processing_id pair. If data_processing_id is unique within the repository, then it is sufficient to say unique within the data_processing_id.

Finally, rearrangement_id should be unique to the repository as well, that is it is the internal identifier for the repository for a single rearrangement entry. This is the only one that states anything about uniqueness at the moment.

Should we review this?

@schristley
Copy link
Member

descriptions we do not mention uniqueness criteria for many (any?) of them

The global uniqueness for repertoire_id and uniqueness of data_processing_id within a repertoire is in the documentation for the repertoire schema (look under Linking Data)

http://docs.airr-community.org/en/metadata-docs/datarep/metadata.html

but you are right that most of the other _ids aren't well specified.

@schristley
Copy link
Member

MiAIRR specifies that study IDs should be unique

I'm not sure this is true, where does it say that?

@javh
Copy link
Contributor

javh commented Sep 11, 2019

Yeah, they don't seem well documented. For pair_id, clone_id, and cell_id do you mean "unique" in the sense of "a uniquely identifiable clone_id/cell_id/pair_id represents all rows assigned to the same clonal cluster/cell/receptor"? By definition, they won't be unique in the same sense as sequence_id which is a 1-to-1 relationship with id-to-rows, as they are 1-to-many.

Ie, if we change the wording of cell_id from:

Identifier defining the cell of origin for the query sequence.

To:

Identifier uniquely defining the cell of origin for the query sequence.

Does that address the concern? Do we need to specify within the same rearrangement_id, repertoire_id, or file?

@bussec
Copy link
Member

bussec commented Sep 11, 2019

MiAIRR specifies that study IDs should be unique

I'm not sure this is true, where does it say that?

It is not stated explicitly for MiAIRR in general, but the NCBI implementation requires mapping of study_id to BioProject's Project/ProjectID/ArchiveID/accession attribute, which is a UID (see here).

@schristley
Copy link
Member

NCBI implementation requires mapping of study_id to BioProject's Project/ProjectID/ArchiveID/accession attribute

Okay, right, so technically it is unique with a data repository and it could (potentially) be globally unique if those repositories have id's that don't conflict.

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 11, 2019

MiAIRR specifies that study IDs should be unique

I'm not sure this is true, where does it say that?

It is not stated explicitly for MiAIRR in general, but the NCBI implementation requires mapping of study_id to BioProject's Project/ProjectID/ArchiveID/accession attribute, which is a UID (see here).

We have this statement:

1 / study	Study	string	Free text	Unique ID assigned by study registry	PRJNA001	study_id

in:
https://github.com/airr-community/airr-standards/blob/metadata-docs/AIRR_Minimal_Standard_Data_Elements.tsv

Assuming "study registry" is an INSDC repository, then I think we have uniqueness don't we?

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 11, 2019

The global uniqueness for repertoire_id and uniqueness of data_processing_id within a repertoire is in the documentation for the repertoire schema (look under Linking Data)

OK, I had missed that... I think that this should probably be mentioned in the "description" of those fields in the spec, no? I have added some of this to the repertoire_id "description" in the spec file. This is quite an important link between the two API entry points, so I think it should be clear...

@schristley
Copy link
Member

added some of this to the repertoire_id "description" in the spec file

Yeah, that's fine for now. In #219, I mention to @bussec about some fields having really long descriptions and that kinda makes the table look not so great, repertoire_id is kinda on the edge of being obnoxious, but we can probably trim it down to be more concise. I think some of this stuff needs to be put in a Definitions Clarification in the docs, like was done with the Rearrangement schema, versus trying to cram it all into the description.

http://docs.airr-community.org/en/metadata-docs/datarep/metadata.html#repertoire-fields

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 11, 2019

data_processing_id should be unique withing a repertoire_id at least. It feels like data_processing_id should be unique at the repository level as well, so you can easily identify a set of rearrangements that have been processed with the same data_processing without having to do a combined repertoire_id x data_processing_id query but again, nothing is explicitly stated in the spec.

So what about this case? From a repository optimization perspective, it would be VERY useful to be have data_processing_id to be unique at the same level as repertoire_id (unique within a repository). When one queries at the rearrangement level for the set of rearrangements it would be nice to be able to query directly for just the rearrangements for a repertoire processed with a specific tool (a specific data_processing_id).

In fact, I would argue that the rearrangement query that would be most common would be queries at the data_processing_id level, and one would rarely be searching rearrangement data for a specific repertoire_id as a single set of data with different data_processing applied (e.g. MixCR and igblast with the annotations not separated by data_processing_id). It is more likely you would be asking for a single data_processing_id from within each repertoire that you are interested in. For example, I think common rearrangement query scenarios would be, for a specific set of repertoire_ids that I am interested in:

  1. I want all of the "primary" processed rearrangements for each repertoire. That is, give me all of the rearrangements from all of my repertoires of interest where the "data_processing_id" is the "primary_annotation" for each repertoire. If there is only one data_processing, then that is by default the "primary_annotation"
  2. I want all of the "MiXCR" annotated data. That is, give me all of the rearrangements from all of my repertoires of interest where the "data_processing_id" is the data that has been annotated by MiXCR. That is data_processing.software_versions contains "MiXCR".

These are all building lists of data_processing_ids to search on, and you almost always want to be using a single data_processing object from a repertoire (correct me if I am wrong). The main time you wouldn't want to have one data_processing_id is if you were comparing between data_processing_ids within a single repertoire (comparing the results of MiXCR vs igblast). Even in this case, you would want to split the rearrangement data between the data_processing_ids so you could separate the MiXCR and igblast data for comparison.

In the cases where there is only one data_processing object, we state that one should use a repertoire_id rather than a data_processing_id. I think this could get quite cumbersome, as then you are generating queries that have a mix of repertoire_id (if there is only one data_processing object in the repertoire) and data_processing_id (if there is more than one data_processing object in the repertoire).

In most cases it seems to me that using data_processing_id rather than repertoire_id will be the rearrangement query of choice. If that is true, we want to optimize our searches at least as well for data_processing_id as we do for repertoire_id. Having data_processing_id be unique at the repository level would help enormously with this...

@schristley
Copy link
Member

...

man, that's a lot of words, do I really need to read all that? Did you just have a shot of espresso? ;-D

I'm not against data_processing_id being unique within repository, I guess I'm also okay with it being globally unique like repertoire_id but neither seem to be needed for the common query scenarios that you mention.

Just remember that a data_processing_id won't necessarily get you the rearrangements for all the repertoires in study, it will only get you the repertoires that were processed the same. There is nothing preventing users from processing repertoires within a study differently. So from that perspective, you will likely need repertoire_id to include/exclude the proper repertoires, and yeah use repertoire_id and data_processing_id as a combo key.

From an implementation perspective, the data_processing_id ends up being unique within VDJServer. This is because we store the analysis provenance as an individual object in the database so it gets a uuid. But, for example, VDJServer processes B and T cells differently, so in a combined study like the Florian study, the B cell rearrangements have a different data_processing_id from the T cell rearrangements.

@bcorrie bcorrie added the ADC API V1 AIRR Data Commons API V1 label Sep 12, 2019
@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 12, 2019

...

man, that's a lot of words, do I really need to read all that? Did you just have a shot of espresso? ;-D

Yeah, sorry, I was challenged to try and capture the problem clearly 8-)

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 12, 2019

From an implementation perspective, the data_processing_id ends up being unique within VDJServer.

Same for iReceptor, and this seems to be really useful, and was one of the drivers for my question. In addition, it looks like the iReceptor Gateway will be extracting data_processing_id from Repertoires and generating rearrangement queries using data_processing_id and NOT repertoire_id. Given the above, it seems to me that there are good reasons to make it a "unique within repository" id and not too many against...

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 12, 2019

But, for example, VDJServer processes B and T cells differently, so in a combined study like the Florian study, the B cell rearrangements have a different data_processing_id from the T cell rearrangements.

Would these two data_processing objects (one for B cells and one for T cells) be in the same Repertoire in your API response?

Would it be possible for you to generate an example /airr/v1/repertoire response for a single repertoire that would have this structure. I think we understand what this would look like, but having a concrete example for us to work with from a Gateway presentation layer would be very helpful!!!

As far as I have seen, the repertoire responses on the docs pages only have a single data_processing object for each repertoire.

@schristley
Copy link
Member

Would these two data_processing objects (one for B cells and one for T cells) be in the same Repertoire in your API response?

No.

having a concrete example

look at the florian example data:

https://github.com/airr-community/airr-standards/blob/master/lang/python/examples/florian.airr.yaml

or

the test data set as I've enhanced it somewhat:

https://github.com/airr-community/adc-api-tests/blob/master/datasets/florian/florian.airr.yaml

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 12, 2019

Just remember that a data_processing_id won't necessarily get you the rearrangements for all the repertoires in study, it will only get you the repertoires that were processed the same. There is nothing preventing users from processing repertoires within a study differently. So from that perspective, you will likely need repertoire_id to include/exclude the proper repertoires, and yeah use repertoire_id and data_processing_id as a combo key.

I think my main point in my rambling above was that it seemed to me that one would almost never do a search at the rearrangement level for a repertoire_id EXCEPT in the case where there was only one data_processing object.

The reason for this is that if any Repertoire has more than one data_processing object, when looking for rearrangements for that Repertoire you are almost always going to want to be explicit about which rearrangements you are retrieving (how they were processed and therefore which data_processing_id), otherwise the rearrangements returned will be very confusing! In my examples above where a Repertoire has more than one data_processing object, you would almost always want either the rearrangements from the "primary_annotation" or the rearrangements that have been processed in a specific way (e.g. by an explicit tool such as MiXCR).

If you have to search by data_processing_id for some rearrangements from some Repertoires, then it makes sense to be consistent and always search for data_processing_id even when there is only one data_processing object.

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 12, 2019

Would these two data_processing objects (one for B cells and one for T cells) be in the same Repertoire in your API response?

No.

OK... Too bad in a way, as we are looking for a concrete example where this would occur in a study...

Currently, as far as I know, all of our data (meaning IPA and VDJServer) has Repertoires with single sample and single data_processing objects. This is easy... The iReceptor Gateway has to handle the situation when a Repertoire can have either an array of sample objects or an array of data processing objects (or both), and it is very unclear to us when this would occur, how this should be presented to the user, and how queries about the rearrangements in such a Repertoire should be generated.

@schristley
Copy link
Member

I think my main point in my rambling above was that it seemed to me that one would almost never do a search at the rearrangement level for a repertoire_id EXCEPT in the case where there was only one data_processing object.

Incorrect, you will almost ALWAYS want to use a repertoire_id AND a data_processing_id to the get rearrangements that you want. It's only in the special case when the repertoire has just a single data_processing that you can leave data_processing_id out.

The reason for this is that if any Repertoire has more than one data_processing object...

You are latching onto the scenario of multiple data_processing objects, I agree with all your points about that scenario. But in that scenario, you seem to be indicating that the repertoire_id is not relevant, and that's incorrect. So here is a contrived example:

Given a study that has 10 repertoires. 5 healthy control repertoires and 5 cancer repertoires. They all have a single data_processing object.

Now a user comes along, they do a query for all healthy repertoires, they get those 5 out of 10 repertoires from that study (plus presumably repertoires from other studies).

Now if you do a query on the rearrangements using ONLY the data_processing_id, you will get rearrangements for all 10 repertoires, which is wrong. The only way to get the correct rearrangements is to query on those 5 repertoire_ids AND the data_processing_id.

So the repertoire_id is always needed when querying the rearrangements, that's how the API was designed!

This is regardless of whether the data_processing_id is unique or not. The uniqueness doesn't guarantee that you get the proper repertoires.

@schristley
Copy link
Member

So the repertoire_id is always needed when querying the rearrangements, that's how the API was designed!

That's assuming the standard workflow where you query metadata first to get a list of repertoires, then query rearrangements. Of course, you can also go the other way and query rearrangements first to get a list of repertoires, then lookup their metadata, like if doing a straight CDR3 search.

@schristley
Copy link
Member

The iReceptor Gateway has to handle the situation when a Repertoire can have either an array of sample objects or an array of data processing objects (or both), and it is very unclear to us when this would occur, how this should be presented to the user, and how queries about the rearrangements in such a Repertoire should be generated.

The array of sample objects is useful for display/query purposes on the repertoire metadata, but becomes irrelevant when querying rearrangements because those samples all collapse into a single repertoire_id.

The array of data processing object is relevant, and needs to be handle because in general, when you query a bunch of studies, they are all going to have different data processing. So how are the users going to decide which ones they want?? This gets to one of the fundamental questions we've been debating in iR+, if everything is processed differently...

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 12, 2019

Given a study that has 10 repertoires. 5 healthy control repertoires and 5 cancer repertoires. They all have a single data_processing object.

Do you mean that there is:

  • one data_processing object (and therefore one data_processing_id) in the entire study
  • all 10 repertoires have a single data_processing object
  • all 10 repertoires refer to the same data_processing object by referring to the same data_processing_id

In this case, all the rearrangements in this study also have the same data_processing_id.

Correct??? 8-)

@schristley
Copy link
Member

schristley commented Sep 12, 2019

Do you mean...

Yes to all. I kept it simple. Did you understand my point?

Now if you do a query on the rearrangements using ONLY the data_processing_id, you will get rearrangements for all 10 repertoires, which is wrong. The only way to get the correct rearrangements is to query on those 5 repertoire_ids AND the data_processing_id.

Unless you are going to be pedantic and say "you don't need AND data_processing_id in that case because there is only one" then I would say yes yes that isn't the point I was trying get across.

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 13, 2019

Do you mean...

Yes to all. I kept it simple. Did you understand my point?

Yes, but I think this is where my confusion originally stemmed from and is similar to the reason why I was suggesting that we should change it so data_processing_id be unique to the repository. The uniqueness criteria of these _id fields are still very fuzzy.

Your example above, as I described it, uses a single data_processing_id to be referred to by several independent repertoires, which requires a data_processing_id that is unique across the repository. The current spec/docs do not allow for this. It doesn't stop you using the same data_processing_id for multiple repertoires, but it doesn't enforce the fact that they are the same nor does it restrict another repertoire from reusing the same data_processing_id for a completely different data_processing process (http://docs.airr-community.org/en/metadata-docs/datarep/metadata.html):

The data_processing_id is only unique within a Repertoire so repertoire_id should first be used to get the appropriate Repertoire object and then data_processing_id used to acquire the appropriate DataProcessing.

With our current definition of requiring a data_processing_id to be unique within a repertoire, your example above works because it is the repertoire_id, data_processing_id pair that is unique. The fact that the data_processing_id is the same across them all doesn't really have an impact. If this is the case, the argument for making it unique across the repository is probably not that important...

I think what I was looking for in suggesting uniqueness for data_processing_id was a unique repository wide identifier for each repertoire_id, data_processing_id pair. I was looking for a single _id that I could use to get all of the rearrangements for a specific repertoire and a specific data processing as applied to that repertoire VERY efficiently. As you say, that is not what a data_processing_id is!

In hindsight, I think it best to leave that optimization to being an internal repository optimization if desired/required. A repository can implement having unique data_processing_ids (I think VDJServer does/will). A specific researcher could build a single data_processing object and reuse it. And a repository could create an internal compound index on repertoire_id, data_processing_id to optimize rearrangement looks ups.

I don't think the spec and the API are the places to enforce any of these. Maybe we don't need to change how data_processing_id is defined.

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 13, 2019

The array of sample objects is useful for display/query purposes on the repertoire metadata, but becomes irrelevant when querying rearrangements because those samples all collapse into a single repertoire_id.

Can you give me a concrete example of how this would be used? I don't follow the use case of when you would have multiple samples in a Repertoire and how one would map rearrangements to that repertoire... I understand the use case of multiple data_processing objects in a Repertoire, but not the multiple sample objects in a Repertoire.

@schristley
Copy link
Member

Yes, but I think this is where my confusion originally stemmed from and is similar to the reason why I was suggesting that we should change it so data_processing_id be unique to the repository.

Okay, good, I was having difficulty coming up with a clear example to explain that it was immaterial whether data_processing_id was unique to the repository or not.

@schristley
Copy link
Member

because it is the repertoire_id, data_processing_id pair that is unique.

Correct, and that is the case for the other _ids as well: clone_id, cell_id and pair_id.

In some sense, rearrangement_id could be like those as well but because we have an explicit API entrypoint for it, it needs to be unique at the repository level.

@schristley
Copy link
Member

schristley commented Sep 13, 2019

Can you give me a concrete example of how this would be used?

It's a contrived example though not completely crazy. Let's say a study with one subject where the patient goes through a treatment. Initially a single blood draw which is sequenced and becomes a single pre-treatment sample. So at this point, we have a single repertoire with a single sample.

Some time later the patient is treated, and at that time another blood draw is taken, but also a tissue sample is taken, both are sequenced. In particular, the tissue sample has a disease_state_sample: cancer, while the two blood samples have disease_state_sample: null because the "histopathologic evaluation" indicates the blood is normal.

Now the researcher wants to analyze all three samples together, say to extract common clones, so creates a single repertoire object with three samples. Very concisely the repertoire looks like this:

repertoire:
  repertoire_id: some-id
  sample:
    - sample_id: blood pre-treatment
    - sample_id: blood post-treatment
    - sample_id: tissue post-treatment
      disease_state_sample: cancer
  data_processing:
    - data_processing_id: 1
      primary_annotation: true

The study is published, the data is made public. Now somebody comes along and does a query for repertoire with cancer samples, something like this:

{ filter: { "op:"=", content: {"field":"sample.disease_state_sample", "value": "cancer"}}}

So I hope you agree that this repertoire will show up in the query results.

Now if that person looks at the repertoire, the UI will show them it has three samples, and they look at them in detail and say oh, its two blood samples and one tissue sample combined together for analysis. Then they make some decision on whether they want to use the rearrangements from that repertoire or not. If they do, then they query the rearrangement entrypoint with the repertoire_id and the data_processing_id.

Is this an example you are looking for?

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 13, 2019

Yes, that is great... thanks... We are trying to determine what level of data should appear on what we used to call our "samples" page and is now called our "repertoire" page. Currently, we display samples on the repertoire page... This is fine at the moment, because all of our data has one sample per repertoire and one data processing per repertoire.

But there are a bunch of different ways you could handle that in the general schema case...

  • one row per repertoire
  • one row per sample
  • one row per data_processing
  • one row per data_processing/sample pair

Once we get that sorted, we then need to figure out what to do to get the rearrangements for the entities that you decide you are interested from the above list. Essentially, we need to generate a query with repertoire_ids and data_processing_ids. I think I can see how you would do most combinations above, but...

Lets say I am a researcher and I want data from blood samples where the disease state of the sample is cancer and I want post treatment data only. So I only want the rearrangements from one of the samples in the Repertoire. In this example, I don't see a way to do that by querying the rearrangements API entry point... Even a repertoire_id/data_processing_id pair does not allow me to differentiate the rearrangements between the samples, so I can't get just the rearrangements from the blood post treatment sample...

@schristley
Copy link
Member

So I only want the rearrangements from one of the samples in the Repertoire. In this example, I don't see a way to do that by querying the rearrangements API entry point... Even a repertoire_id/data_processing_id pair does not allow me to differentiate the rearrangements between the samples, so I can't get just the rearrangements from the blood post treatment sample

Correct, in general that is not possible. That's all the gory details from #181 (if you re-read that, change the old "rearrangement set" terminology to "data_processing_id")

@bcorrie
Copy link
Contributor Author

bcorrie commented Sep 13, 2019

Nooooooooooo, not #181

Maybe this is why I have been so confused... In jumping to the end of #181 we discuss having a sample_processing_id, and for each rearrangement I suggested having "... (three identifiers, RepertoireID, SampleProcessingID, and SoftwareProcessingID)" This seemed to have pretty general consensus.

In our current spec we have repertoire_id (RepertoireID) and data_processing_id (SoftwareProcessingID). What happened to SampleProcessingID? At the end of the issue you mentioned having a sample_processing_id that would be sufficient for many cases, but we don't have that in our current spec? I think this needs to be added, no?

@bcorrie
Copy link
Contributor Author

bcorrie commented Feb 26, 2020

Right now, we have a documented mechanism, we can make that more precise. I'm still not sure why you think it is so challenging. Either you are over-complicating it or are being too expansive. The simplest technique (which is what is documented) is to use a repository unique prefix code, like "ipa" or "vdjs" or something, then attach that to repository unique number or code, so "vdjs-1", "vdjs-2" and so on.

I think the problem is we are combining multiple roles for repertoire_id. If we need to differentiate such things in an API response, I would prefer to have a separate field in the ADC API response rather than conflate the repertoire_id to capture two different concepts. In the model you are suggesting you are combining the bioinformatic concept of Repertoire with the technology concept Repository. This seems very messy to me...

The ADC API could just as easily have a separate field in the response that provided this information that looked something like this:

"Repertoire": [
  {"repertoire_id":"4357957907784536551-242ac11c-0001-012","repository_id":"vdjs1", ...}
]

and

  "Rearrangement":
  [
    {
      "rearrangement_id":"5d6fba725dca5569326aa104",
      "repertoire_id":"1841923116114776551-242ac11c-0001-012",
      "repository_id":"vdjs1",
      "... remaining fields":"snipped for space"
    }
  ]

I don't think we want repositories and API responses changing fields in the specification, in particular changing fields that might be provided by a researcher.

For example, think of this from a DataRep perspective. I, as a researcher, want to use AIRR Repertoire JSON and Rearrangement TSV files to document a study (much like you have done for the Florian study). I want to use standards to document my study in a AIRR compliant way, in particular so I can use AIRR compliant tools to process my data. I manually choose repertoire_id names that are meaningful to me as a researcher. They are unique in my study, and allow me to map rearrangements in my Rearrangement TSV files to my repertoire metadata in my Repertoire JSON file.

Using the AIRR formats in this use case scenario doesn't require any change from a researcher. In fact, they can go from this simply use case all the way to loading the data into an ADC repository and operating on federated data transparently, without any of the Repertoire metadata needing to change. The only change required by being able to work on federated data globally is the addition of another field.

In fact, if we really wanted to do this right, we would have a DOI for each AIRR Repository (make that a condition of being AIRR compliant) and then we could have:

"Repertoire": [
  {
      "repertoire_id":"4357957907784536551-242ac11c-0001-012",
      "repository_doi":"https://doi.org/10.25504/FAIRsharing.ekdqe5", ...
  }
]

@schristley
Copy link
Member

In fact, if we really wanted to do this right, we would have a DOI for each AIRR Repository (make that a condition of being AIRR compliant) and then we could have:

Yeah, after #320 I've started thinking this route too. Though my thought was to provide a DOI for the repertoire versus a DOI for the repository

"Repertoire": [
  {
      "repertoire_id":"4357957907784536551-242ac11c-0001-012",
      "repertoire_doi":"https://vdjserver.org/airr/v1/4357957907784536551-242ac11c-0001-012", ...
  }
]

I'm not sure which is better. The important thing is that the fully qualified URL is available or can be constructed (we would need to document exactly how to do that).

Regardless, we still haven't resolved the issue that repertoires downloaded from two different repositories may have repertoire_ids that conflict.

We should discuss this in the CRWG meeting tomorrow and see if we can come to a solution.

@bcorrie
Copy link
Contributor Author

bcorrie commented Feb 27, 2020

Yeah, after #320 I've started thinking this route too. Though my thought was to provide a DOI for the repertoire versus a DOI for the repository

That is a lot of DOIs 8-)

@bcorrie
Copy link
Contributor Author

bcorrie commented Feb 27, 2020

Regardless, we still haven't resolved the issue that repertoires downloaded from two different repositories may have repertoire_ids that conflict.

My thought is that it is OK for repertoire_ids to conflict if we have another field for the AIRR Data Commons that makes a repertoire unique "globally" (at least unique in the ADC). repertoire_id is part of the informatic data model (a "DataRep" thing) and is something you need to make a study describable using the AIRR Standards. In this case, you don't need something globally unique.

If you are working at the AIRR Data Commons level and federating data from all over the place, then repertoire_doi (or whatever we call it) is the ADC thing that is necessary for the ADC to work. My concern is overloading one field to serve both purposes...

@schristley
Copy link
Member

That is a lot of DOIs 8-)

haha true! Though I was meaning DOI in the general context of a digital object identifier and not the doi.org service...

So actually then repertoire_doi is semantically confusing as repertoire_id is the actual digital object identifier.

I think the problem is we are combining multiple roles for repertoire_id.

That wasn't my intent. I was just suggesting a scheme to construct a global identifier, similar to how SRA and ENA co-exist. Both accept raw sequence data, SRA prefixes its identifiers with SRP while ENA prefixes with ERP.

My thought is that it is OK for repertoire_ids to conflict if we have another field for the AIRR Data Commons that makes a repertoire unique "globally" (at least unique in the ADC). repertoire_id is part of the informatic data model (a "DataRep" thing) and is something you need to make a study describable using the AIRR Standards. In this case, you don't need something globally unique.

If you are working at the AIRR Data Commons level and federating data from all over the place, then repertoire_doi (or whatever we call it) is the ADC thing that is necessary for the ADC to work. My concern is overloading one field to serve both purposes...

Just be to clear, repertoire_id was devised by CRWG, not DataRep, not MiAIRR. So no, it's not a "DataRep thing". I understand how it seems that way now, and maybe you are right that it's been "taken over" by DataRep and used for a different purpose, but it was initially created as an "ADC thing that is necessary for the ADC to work." But it's not published yet and made into the standard, so CRWG can still decide what its purpose is and make any changes.

Now did we make an initial mistake with repertoire_id by considering it be just a simple identifier versus a fully qualified doi? Probably. Maybe you are right and we need two separate fields for two separate purposes. That I'm not so sure about, why not just make repertoire_id a fully qualified name? I never really liked that though because that just seems like a waste of space, especially when talking about rearrangements, but I still think globally uniqueness is extremely useful. Even more so, the CRWG recognized that as important as it put key provisions into the recommendations document for unique identifiers (specifically 8 and 9).

Now IEDB takes the two field approach.

"Epitope ID": 16878
"Epitope IRI": "http://www.iedb.org/epitope/16878"

but their data size is much smaller and they are a centralized database. We need to think a little more carefully about our distributed system as well as the data size.

If every rearrangement records has those two fields, that seems less than ideal.

@javh
Copy link
Contributor

javh commented Feb 27, 2020

All of the *_id fields are supposed to be a unique key in their given context. What's the issue with just putting a doi in the repertoire_id field? Modifying the value of the field upon import if necessary to make it universally unique? In general, I think proliferation of identifier fields, each with a different scope, is more harmful than helpful. (I have similar feeling about sequence_id and rearrangement_id being redundant in Rearrangement, but I think that's water under the bridge.)

If you want some sort of descriptive, but not unique, identifier I think it'd be better to add fields for that purpose like those that exist in Study, rather than more identifier fields. Eg, repertoire_description.

@schristley
Copy link
Member

I have similar feeling about sequence_id and rearrangement_id being redundant in Rearrangement, but I think that's water under the bridge.

@javh Is it too late to fix this? I don't like it either, but now really is the time to fix it, as it's only going to cause more hassles as we add objects that reference the rearrangements (Node being the first)

@schristley
Copy link
Member

What's the issue with just putting a doi in the repertoire_id field?

For me it's mostly wasted space in the rearrangement table, having https://vdjserver.org/airr/v1/ prefixed to each repertoire_id is a waste. I'd much prefer that it was in the metadata where maybe a single field repository_url: https://vdjserver.org/airr/v1 might suffice.

@javh
Copy link
Contributor

javh commented Feb 28, 2020

I think it's too late to rename sequence_id to rearrangement_id, but maybe not to late to adjust the definition of sequence_id to be unique within the appropriate context (so it can be a UUID if needed). That's already how sequence_id is support to work - we just didn't have the bigger Repertoire context when we defined it. And maybe not too late to add sequence_name/description as an optional field to store non-unique sequence identifiers.

@schristley
Copy link
Member

I'm fine with eliminating rearrangement_id and keeping sequence_id. The question is, is it okay if a data repository overwrites sequence_id with its own value, thus breaking the link with a sequence record in the raw sequencing files?

@javh
Copy link
Contributor

javh commented Feb 28, 2020

I think so. Any time you collapse reads you break that association anyway, so that is already happening if you merge reads by removing duplicates, building UMI consensus sequences, or aggregating clonotypes.

You also need to uniquify sequence_id just to concatenate TSV files from different samples. I don't see database import as any different.

@scharch
Copy link
Contributor

scharch commented Feb 28, 2020

Similar to @javh's comment: SONAR allows multiple fasta/fastq files as input and therefore always automatically renames every sequence with a simple serial number to avoid possible duplicates. For longitudinal (multiple sample) analysis, it adds a sample-designating prefix to the serial number. I do preserve "source_file" and "source_id" as custom columns in the rearrangements TSV, but those presumably aren't useful in the context of uploading to a repository, anyway.

@schristley
Copy link
Member

schristley commented Mar 11, 2020

I was looking through the GDC documentation to see if they had any discussion about the uniqueness of their id fields. I didn't find any discussion but I did find this statement

All objects (entities) in the GDC are assigned a unique identifier in the form of a version 4 universally unique identifier (UUID). The UUID uniquely identifies the entity in the GDC, and is stored in the entity's id property.

As GDC is a centralized repository, versus distributed like ADC, it isn't clear why they felt they needed UUIDs when they could have gone with something simpler.

@bcorrie
Copy link
Contributor Author

bcorrie commented Apr 8, 2020

In the interest of closing off issues related to AIRR v1.3 and ADC v1.0, where do we think we are at with this issue?

We have #347 for the more global discussion of PID/DOI for repository objects, and I think we agree that that can be deferred until post v1.3/v1.0 (#347 (comment))

We have closed off the sequence_id/rearrangement_id discussion in #340

Discussions around _id fields in Cell and Clone are considered in #320 and #317

And I think we all agree that in general _id fields need to be unique in the context in which they are considered (as per @javh comment here #246 (comment)).

Even thought we haven't gone the full UUID and PID/DOI route yet on any of these, we haven't excluded them...

So, my question is, given the above, can we close this issue off?

@javh
Copy link
Contributor

javh commented Apr 8, 2020

I think so.

@bcorrie
Copy link
Contributor Author

bcorrie commented Apr 16, 2020

OK, I am officially closing... 8-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADC API V1 AIRR Data Commons API V1 documentation
Projects
None yet
Development

No branches or pull requests

5 participants