-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uniqueness of _id fields in airr_schema.yaml #246
Comments
The global uniqueness for http://docs.airr-community.org/en/metadata-docs/datarep/metadata.html but you are right that most of the other _ids aren't well specified. |
I'm not sure this is true, where does it say that? |
Yeah, they don't seem well documented. For Ie, if we change the wording of
To:
Does that address the concern? Do we need to specify within the same |
It is not stated explicitly for MiAIRR in general, but the NCBI implementation requires mapping of |
Okay, right, so technically it is unique with a data repository and it could (potentially) be globally unique if those repositories have id's that don't conflict. |
We have this statement:
Assuming "study registry" is an INSDC repository, then I think we have uniqueness don't we? |
OK, I had missed that... I think that this should probably be mentioned in the "description" of those fields in the spec, no? I have added some of this to the repertoire_id "description" in the spec file. This is quite an important link between the two API entry points, so I think it should be clear... |
Yeah, that's fine for now. In #219, I mention to @bussec about some fields having really long descriptions and that kinda makes the table look not so great, http://docs.airr-community.org/en/metadata-docs/datarep/metadata.html#repertoire-fields |
So what about this case? From a repository optimization perspective, it would be VERY useful to be have data_processing_id to be unique at the same level as repertoire_id (unique within a repository). When one queries at the rearrangement level for the set of rearrangements it would be nice to be able to query directly for just the rearrangements for a repertoire processed with a specific tool (a specific data_processing_id). In fact, I would argue that the rearrangement query that would be most common would be queries at the data_processing_id level, and one would rarely be searching rearrangement data for a specific repertoire_id as a single set of data with different data_processing applied (e.g. MixCR and igblast with the annotations not separated by data_processing_id). It is more likely you would be asking for a single data_processing_id from within each repertoire that you are interested in. For example, I think common rearrangement query scenarios would be, for a specific set of repertoire_ids that I am interested in:
These are all building lists of data_processing_ids to search on, and you almost always want to be using a single data_processing object from a repertoire (correct me if I am wrong). The main time you wouldn't want to have one data_processing_id is if you were comparing between data_processing_ids within a single repertoire (comparing the results of MiXCR vs igblast). Even in this case, you would want to split the rearrangement data between the data_processing_ids so you could separate the MiXCR and igblast data for comparison. In the cases where there is only one data_processing object, we state that one should use a repertoire_id rather than a data_processing_id. I think this could get quite cumbersome, as then you are generating queries that have a mix of repertoire_id (if there is only one data_processing object in the repertoire) and data_processing_id (if there is more than one data_processing object in the repertoire). In most cases it seems to me that using data_processing_id rather than repertoire_id will be the rearrangement query of choice. If that is true, we want to optimize our searches at least as well for data_processing_id as we do for repertoire_id. Having data_processing_id be unique at the repository level would help enormously with this... |
man, that's a lot of words, do I really need to read all that? Did you just have a shot of espresso? ;-D I'm not against Just remember that a From an implementation perspective, the |
Yeah, sorry, I was challenged to try and capture the problem clearly 8-) |
Same for iReceptor, and this seems to be really useful, and was one of the drivers for my question. In addition, it looks like the iReceptor Gateway will be extracting data_processing_id from Repertoires and generating rearrangement queries using data_processing_id and NOT repertoire_id. Given the above, it seems to me that there are good reasons to make it a "unique within repository" id and not too many against... |
Would these two data_processing objects (one for B cells and one for T cells) be in the same Repertoire in your API response? Would it be possible for you to generate an example /airr/v1/repertoire response for a single repertoire that would have this structure. I think we understand what this would look like, but having a concrete example for us to work with from a Gateway presentation layer would be very helpful!!! As far as I have seen, the repertoire responses on the docs pages only have a single data_processing object for each repertoire. |
No.
look at the florian example data: https://github.com/airr-community/airr-standards/blob/master/lang/python/examples/florian.airr.yaml or the test data set as I've enhanced it somewhat: https://github.com/airr-community/adc-api-tests/blob/master/datasets/florian/florian.airr.yaml |
I think my main point in my rambling above was that it seemed to me that one would almost never do a search at the rearrangement level for a repertoire_id EXCEPT in the case where there was only one data_processing object. The reason for this is that if any Repertoire has more than one data_processing object, when looking for rearrangements for that Repertoire you are almost always going to want to be explicit about which rearrangements you are retrieving (how they were processed and therefore which data_processing_id), otherwise the rearrangements returned will be very confusing! In my examples above where a Repertoire has more than one data_processing object, you would almost always want either the rearrangements from the "primary_annotation" or the rearrangements that have been processed in a specific way (e.g. by an explicit tool such as MiXCR). If you have to search by data_processing_id for some rearrangements from some Repertoires, then it makes sense to be consistent and always search for data_processing_id even when there is only one data_processing object. |
OK... Too bad in a way, as we are looking for a concrete example where this would occur in a study... Currently, as far as I know, all of our data (meaning IPA and VDJServer) has Repertoires with single sample and single data_processing objects. This is easy... The iReceptor Gateway has to handle the situation when a Repertoire can have either an array of sample objects or an array of data processing objects (or both), and it is very unclear to us when this would occur, how this should be presented to the user, and how queries about the rearrangements in such a Repertoire should be generated. |
Incorrect, you will almost ALWAYS want to use a repertoire_id AND a data_processing_id to the get rearrangements that you want. It's only in the special case when the repertoire has just a single data_processing that you can leave data_processing_id out.
You are latching onto the scenario of multiple data_processing objects, I agree with all your points about that scenario. But in that scenario, you seem to be indicating that the repertoire_id is not relevant, and that's incorrect. So here is a contrived example: Given a study that has 10 repertoires. 5 healthy control repertoires and 5 cancer repertoires. They all have a single data_processing object. Now a user comes along, they do a query for all healthy repertoires, they get those 5 out of 10 repertoires from that study (plus presumably repertoires from other studies). Now if you do a query on the rearrangements using ONLY the data_processing_id, you will get rearrangements for all 10 repertoires, which is wrong. The only way to get the correct rearrangements is to query on those 5 repertoire_ids AND the data_processing_id. So the repertoire_id is always needed when querying the rearrangements, that's how the API was designed! This is regardless of whether the data_processing_id is unique or not. The uniqueness doesn't guarantee that you get the proper repertoires. |
That's assuming the standard workflow where you query metadata first to get a list of repertoires, then query rearrangements. Of course, you can also go the other way and query rearrangements first to get a list of repertoires, then lookup their metadata, like if doing a straight CDR3 search. |
The array of sample objects is useful for display/query purposes on the repertoire metadata, but becomes irrelevant when querying rearrangements because those samples all collapse into a single repertoire_id. The array of data processing object is relevant, and needs to be handle because in general, when you query a bunch of studies, they are all going to have different data processing. So how are the users going to decide which ones they want?? This gets to one of the fundamental questions we've been debating in iR+, if everything is processed differently... |
Do you mean that there is:
In this case, all the rearrangements in this study also have the same data_processing_id. Correct??? 8-) |
Yes to all. I kept it simple. Did you understand my point?
Unless you are going to be pedantic and say "you don't need AND data_processing_id in that case because there is only one" then I would say yes yes that isn't the point I was trying get across. |
Yes, but I think this is where my confusion originally stemmed from and is similar to the reason why I was suggesting that we should change it so data_processing_id be unique to the repository. The uniqueness criteria of these _id fields are still very fuzzy. Your example above, as I described it, uses a single data_processing_id to be referred to by several independent repertoires, which requires a data_processing_id that is unique across the repository. The current spec/docs do not allow for this. It doesn't stop you using the same data_processing_id for multiple repertoires, but it doesn't enforce the fact that they are the same nor does it restrict another repertoire from reusing the same data_processing_id for a completely different data_processing process (http://docs.airr-community.org/en/metadata-docs/datarep/metadata.html):
With our current definition of requiring a data_processing_id to be unique within a repertoire, your example above works because it is the repertoire_id, data_processing_id pair that is unique. The fact that the data_processing_id is the same across them all doesn't really have an impact. If this is the case, the argument for making it unique across the repository is probably not that important... I think what I was looking for in suggesting uniqueness for data_processing_id was a unique repository wide identifier for each repertoire_id, data_processing_id pair. I was looking for a single _id that I could use to get all of the rearrangements for a specific repertoire and a specific data processing as applied to that repertoire VERY efficiently. As you say, that is not what a data_processing_id is! In hindsight, I think it best to leave that optimization to being an internal repository optimization if desired/required. A repository can implement having unique data_processing_ids (I think VDJServer does/will). A specific researcher could build a single data_processing object and reuse it. And a repository could create an internal compound index on repertoire_id, data_processing_id to optimize rearrangement looks ups. I don't think the spec and the API are the places to enforce any of these. Maybe we don't need to change how data_processing_id is defined. |
Can you give me a concrete example of how this would be used? I don't follow the use case of when you would have multiple samples in a Repertoire and how one would map rearrangements to that repertoire... I understand the use case of multiple data_processing objects in a Repertoire, but not the multiple sample objects in a Repertoire. |
Okay, good, I was having difficulty coming up with a clear example to explain that it was immaterial whether data_processing_id was unique to the repository or not. |
Correct, and that is the case for the other _ids as well: clone_id, cell_id and pair_id. In some sense, rearrangement_id could be like those as well but because we have an explicit API entrypoint for it, it needs to be unique at the repository level. |
It's a contrived example though not completely crazy. Let's say a study with one subject where the patient goes through a treatment. Initially a single blood draw which is sequenced and becomes a single pre-treatment sample. So at this point, we have a single repertoire with a single sample. Some time later the patient is treated, and at that time another blood draw is taken, but also a tissue sample is taken, both are sequenced. In particular, the tissue sample has a Now the researcher wants to analyze all three samples together, say to extract common clones, so creates a single repertoire object with three samples. Very concisely the repertoire looks like this:
The study is published, the data is made public. Now somebody comes along and does a query for repertoire with cancer samples, something like this:
So I hope you agree that this repertoire will show up in the query results. Now if that person looks at the repertoire, the UI will show them it has three samples, and they look at them in detail and say oh, its two blood samples and one tissue sample combined together for analysis. Then they make some decision on whether they want to use the rearrangements from that repertoire or not. If they do, then they query the rearrangement entrypoint with the repertoire_id and the data_processing_id. Is this an example you are looking for? |
Yes, that is great... thanks... We are trying to determine what level of data should appear on what we used to call our "samples" page and is now called our "repertoire" page. Currently, we display samples on the repertoire page... This is fine at the moment, because all of our data has one sample per repertoire and one data processing per repertoire. But there are a bunch of different ways you could handle that in the general schema case...
Once we get that sorted, we then need to figure out what to do to get the rearrangements for the entities that you decide you are interested from the above list. Essentially, we need to generate a query with repertoire_ids and data_processing_ids. I think I can see how you would do most combinations above, but... Lets say I am a researcher and I want data from blood samples where the disease state of the sample is cancer and I want post treatment data only. So I only want the rearrangements from one of the samples in the Repertoire. In this example, I don't see a way to do that by querying the rearrangements API entry point... Even a repertoire_id/data_processing_id pair does not allow me to differentiate the rearrangements between the samples, so I can't get just the rearrangements from the blood post treatment sample... |
Correct, in general that is not possible. That's all the gory details from #181 (if you re-read that, change the old "rearrangement set" terminology to "data_processing_id") |
Nooooooooooo, not #181 Maybe this is why I have been so confused... In jumping to the end of #181 we discuss having a sample_processing_id, and for each rearrangement I suggested having "... (three identifiers, RepertoireID, SampleProcessingID, and SoftwareProcessingID)" This seemed to have pretty general consensus. In our current spec we have repertoire_id (RepertoireID) and data_processing_id (SoftwareProcessingID). What happened to SampleProcessingID? At the end of the issue you mentioned having a sample_processing_id that would be sufficient for many cases, but we don't have that in our current spec? I think this needs to be added, no? |
I think the problem is we are combining multiple roles for repertoire_id. If we need to differentiate such things in an API response, I would prefer to have a separate field in the ADC API response rather than conflate the repertoire_id to capture two different concepts. In the model you are suggesting you are combining the bioinformatic concept of Repertoire with the technology concept Repository. This seems very messy to me... The ADC API could just as easily have a separate field in the response that provided this information that looked something like this:
and
I don't think we want repositories and API responses changing fields in the specification, in particular changing fields that might be provided by a researcher. For example, think of this from a DataRep perspective. I, as a researcher, want to use AIRR Repertoire JSON and Rearrangement TSV files to document a study (much like you have done for the Florian study). I want to use standards to document my study in a AIRR compliant way, in particular so I can use AIRR compliant tools to process my data. I manually choose repertoire_id names that are meaningful to me as a researcher. They are unique in my study, and allow me to map rearrangements in my Rearrangement TSV files to my repertoire metadata in my Repertoire JSON file. Using the AIRR formats in this use case scenario doesn't require any change from a researcher. In fact, they can go from this simply use case all the way to loading the data into an ADC repository and operating on federated data transparently, without any of the Repertoire metadata needing to change. The only change required by being able to work on federated data globally is the addition of another field. In fact, if we really wanted to do this right, we would have a DOI for each AIRR Repository (make that a condition of being AIRR compliant) and then we could have:
|
Yeah, after #320 I've started thinking this route too. Though my thought was to provide a DOI for the repertoire versus a DOI for the repository
I'm not sure which is better. The important thing is that the fully qualified URL is available or can be constructed (we would need to document exactly how to do that). Regardless, we still haven't resolved the issue that repertoires downloaded from two different repositories may have We should discuss this in the CRWG meeting tomorrow and see if we can come to a solution. |
That is a lot of DOIs 8-) |
My thought is that it is OK for repertoire_ids to conflict if we have another field for the AIRR Data Commons that makes a repertoire unique "globally" (at least unique in the ADC). repertoire_id is part of the informatic data model (a "DataRep" thing) and is something you need to make a study describable using the AIRR Standards. In this case, you don't need something globally unique. If you are working at the AIRR Data Commons level and federating data from all over the place, then repertoire_doi (or whatever we call it) is the ADC thing that is necessary for the ADC to work. My concern is overloading one field to serve both purposes... |
haha true! Though I was meaning DOI in the general context of a digital object identifier and not the doi.org service... So actually then
That wasn't my intent. I was just suggesting a scheme to construct a global identifier, similar to how SRA and ENA co-exist. Both accept raw sequence data, SRA prefixes its identifiers with SRP while ENA prefixes with ERP.
Just be to clear, Now did we make an initial mistake with Now IEDB takes the two field approach.
but their data size is much smaller and they are a centralized database. We need to think a little more carefully about our distributed system as well as the data size. If every rearrangement records has those two fields, that seems less than ideal. |
All of the If you want some sort of descriptive, but not unique, identifier I think it'd be better to add fields for that purpose like those that exist in Study, rather than more identifier fields. Eg, |
@javh Is it too late to fix this? I don't like it either, but now really is the time to fix it, as it's only going to cause more hassles as we add objects that reference the rearrangements ( |
For me it's mostly wasted space in the rearrangement table, having |
I think it's too late to rename |
I'm fine with eliminating |
I think so. Any time you collapse reads you break that association anyway, so that is already happening if you merge reads by removing duplicates, building UMI consensus sequences, or aggregating clonotypes. You also need to uniquify |
Similar to @javh's comment: SONAR allows multiple fasta/fastq files as input and therefore always automatically renames every sequence with a simple serial number to avoid possible duplicates. For longitudinal (multiple sample) analysis, it adds a sample-designating prefix to the serial number. I do preserve "source_file" and "source_id" as custom columns in the rearrangements TSV, but those presumably aren't useful in the context of uploading to a repository, anyway. |
I was looking through the GDC documentation to see if they had any discussion about the uniqueness of their id fields. I didn't find any discussion but I did find this statement
As GDC is a centralized repository, versus distributed like ADC, it isn't clear why they felt they needed UUIDs when they could have gone with something simpler. |
In the interest of closing off issues related to AIRR v1.3 and ADC v1.0, where do we think we are at with this issue? We have #347 for the more global discussion of PID/DOI for repository objects, and I think we agree that that can be deferred until post v1.3/v1.0 (#347 (comment)) We have closed off the sequence_id/rearrangement_id discussion in #340 Discussions around _id fields in Cell and Clone are considered in #320 and #317 And I think we all agree that in general _id fields need to be unique in the context in which they are considered (as per @javh comment here #246 (comment)). Even thought we haven't gone the full UUID and PID/DOI route yet on any of these, we haven't excluded them... So, my question is, given the above, can we close this issue off? |
I think so. |
OK, I am officially closing... 8-) |
@schristley I was looking at several of the _id fields in the schema, and I note in the descriptions we do not mention uniqueness criteria for many (any?) of them. I think this is a problem, isn't it??? Am I missing something?
If I go to the rearrangement level, we have several _ids (pair_id, clone_id, cell_id, rearrangement_id, repertoire_id, and data_processing_id). We don't advise or specify at what level something like a clone_id is unique... Or even a repertoire_id or data_processing_id. MiAIRR specifies that study IDs should be unique (typically an INSDC study related identifier) with subject_ids and sample_ids unique within studies.
It is not well defined what the relationship between _ids is from this level down (pair_id, clone_id, cell_id, rearrangement_id, repertoire_id, and data_processing_id)
One can probably infer (if you know the AIRR spec well) that repertoire_id should be unique at least within a study, maybe a subject. The reality is that the repertoire_id should be unique at the repository level (as they are the IDs returned by the /repertoire API endpoint), but that isn't actually stated in the spec unless I am missing something...
data_processing_id should be unique withing a repertoire_id at least. It feels like data_processing_id should be unique at the repository level as well, so you can easily identify a set of rearrangements that have been processed with the same data_processing without having to do a combined repertoire_id x data_processing_id query but again, nothing is explicitly stated in the spec.
pair_id, clone_id, and cell_id should probably be unique at least unique within a repertoire_id/data_processing_id pair. If data_processing_id is unique within the repository, then it is sufficient to say unique within the data_processing_id.
Finally, rearrangement_id should be unique to the repository as well, that is it is the internal identifier for the repository for a single rearrangement entry. This is the only one that states anything about uniqueness at the moment.
Should we review this?
The text was updated successfully, but these errors were encountered: