Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disentangling assays and rowData #18

Open
lgeistlinger opened this issue May 23, 2018 · 8 comments
Open

Disentangling assays and rowData #18

lgeistlinger opened this issue May 23, 2018 · 8 comments

Comments

@lgeistlinger
Copy link

Following up on previous comments (#17):

RaggedExperiment nicely pretends to be a SummarizedExperiment with regard to assays, rowRanges, and colData.
However, RaggedExperiment does not seem to support a concept such as rowData for annotation to its rowRanges.

Essentially when you do mcols<- or rowData<-, you're adding to "number of assays". Since each column in mcols is an "assay".

I wonder whether this can be modified.
When constructing a RaggedExperiment from a GRangesList, would it be possible to distinguish mcols that become assays and mcols that are annotation?
(Currently this turns even non-numeric mcols into assays)

General motivation for being able to annotate additional row information is as for a SummarizedExperiment - however, in my specific use case I'm working with CNV calls where I only want the copy number to become an assay, while I want to keep additional columns as annotation to each individual CNV call (such as probe identifiers, quality measures, etc).

@LiNk-NY
Copy link
Collaborator

LiNk-NY commented Jun 15, 2018

We could add another structure that will indicate what columns are to be converted to assays and what columns will be shown in rowData.
Any thoughts on this @mtmorgan ?

@mtmorgan
Copy link
Contributor

Can this issue be updated with an illustrative example and desired output? The verbal description is hard to follow and I do not know whether the example in the original issue is still relevant.

@lgeistlinger
Copy link
Author

lgeistlinger commented Jun 17, 2018

Let's consider a typical output from SNP-based CNV callers stored as a data.frame named my.cnv.calls:

> my.cnv.calls
chr  start  end     sample.id  state   num.probes  start.probe   end.probe
chr1 16947  45013   NE001423   3       53          AX-100224358  AX-100363929
chr1 36337  67130   NE001426   1       77          AX-100796878  AX-100422118
chr1 16947  36337   NE001428   3       34          AX-100224358  AX-100796878
chr1 36338  105963  NE001428   0       123         AX-100796878  AX-100828216
chr1 36337  83412   NE001534   3       101         AX-100796878  AX-100765659
chr1 36337  83412   NE001648   3       101         AX-100796878  AX-100765659

We group the calls by sample ID, resulting in a GRangesList.
Each element of the list corresponds to a sample, and contains the genomic coordinates of the CNV calls for this sample (along with the copy number state in the state metadata column).

> grl <- makeGRangesListFromDataFrame(my.cnv.calls,
    split.field="sample.id", keep.extra.columns=TRUE)
> grl
## GRangesList object of length 5:
## $NE001423
## GRanges object with 1 range and 4 metadata columns:
##        seqnames  ranges      strand |     state  num.probes  start.probe  end.probe
##           <Rle>  <IRanges>    <Rle> | <integer>  <integer>   <character>  <character>
##    [1]     chr1  16947-45013      * |         3       53     AX-100224358 AX-100363929 
##
## ...
## <4 more elements>

Let's create some colData for the samples:

age <- sample(20:80, 5)
sex <- sample(c("M", "F"), 5, replace=TRUE)
cdat <- DataFrame(age=age, sex=sex)

Now when creating a RaggedExperiment from the GRangesList, all mcols become assays

> ra <- RaggedExperiment(grl, colData=cdat)
> ra
class: RaggedExperiment 
dim: 6 5 
assays(4): state num.probes start.probe end.probe
rownames: NULL
colnames(5): NE001423 NE001426 NE001428 NE001534 NE001648
colData names(2): age sex 

However, I would like to determine which mcols become assays (and which are annotation
and rather become rowData).

Desired output:

> ra <- RaggedExperiment(grl, colData=cdat, assay.columns="state")
> ra
class: RaggedExperiment 
dim: 6 5 
assays(1): state 
rownames: NULL
colnames(5): NE001423 NE001426 NE001428 NE001534 NE001648
colData names(2): age sex 
rowData names(3): num.probes start.probe end.probe

@mtmorgan
Copy link
Contributor

I don't want to support this. The implication is either that the software is 'smart' enough to fill rowData across ranges that are present in some but not all samples, or that the data is not actually ragged after all and all rows are present in all samples. It seems too arbitrary to know how to solve the first case, and the second case does not represent a ragged experiment.

@lgeistlinger
Copy link
Author

There is no need to fill rowData. Upon construction of a RaggedExperiment common or intersecting ranges across samples are not resolved. As in sparseAssay, the number of rows corresponds to the total number of ranges across all samples with duplicates being allowed. As such, there is a clear 1:1 correspondence between the rows of the assay and the rowData.

@lwaldron
Copy link

@lgeistlinger your desired specifies what the show method would return, but how would the proposed rowData be used in the API? @mtmorgan, I also don't understand why 'smart'-ness would be required.

@lgeistlinger
Copy link
Author

Some instances:

  1. filter all ranges that are not satisfying quality measures annotated in rowData,
    e.g. filter all cnv calls not supported by a sufficient number of probes:
ind <- rowData(ra)$num.probes > 10
ra <- ra[ind,] 
  1. extracting all ranges of a particular type annotated in rowData, e.g. subset the
    cnv calls that are an amplification
ind <- rowData(ra)$type == "amplification"
ra <- ra[ind,]
  1. two-level queries, e.g. get data associated with the start and the end probe
    of the cnv calls:
gt <- my.snp.data[rowData(ra)$start.probe, "genotype"]
# now do a GWAS

That is useful for approaches that resolve CNVs on the probe level for CNV-phenotype
association analysis as e.g. done here: http://zzz.bwh.harvard.edu/plink/cnv.shtml

@lgeistlinger
Copy link
Author

Some more thoughts:

The full-blown RaggedExperiment is sparse and of short-living nature.
Most of the times, either disjoinAssay or qreduceAssay will be used shortly
after creation of the object to summarize assay values in overlapping ranges across
samples.
Subsequent analysis will then mostly focus on the resulting summarized matrices.

Imho, having rowData adds to the usefulness of the representation and, in effect,
extends the object's life span (meaning here the time it's actively used within the analysis).

The two main aspects to this are:

  1. preprocessing steps on the ranges such as filtering, subsetting, etc. based on criteria
    annotated in the rowData.

  2. resolving particular ranges in more detail based on criteria annotated in
    the rowData - often summarization as carried out via simplifyDisjoin and
    simplifyReduce represents a compromise.

Once you have found a number of summarized ranges to be particularly interesting
(eg associated with a phenotype) you might want to go back and resolve individual
ranges within the summarized ranges in more detail to study the impact of the
summarization.

As an example, we had doubt about a particular CNV in this study:

https://www.nature.com/articles/s41598-018-19782-4

Based on SNP data, the region was inferred to be completely deleted (0 copies).
Contradictively, we found a gene contained in this region to be highly expressed.

We thus went back to the individual CNV calls and resolved them on SNP level to
demonstrate that no SNP contributing to the deletion calls locates directly on the
gene, see Figure S3 in here:

https://static-content.springer.com/esm/art%3A10.1038%2Fs41598-018-19782-4/MediaObjects/41598_2018_19782_MOESM1_ESM.pdf

Long story short, often enough you are having characteristics describing your
ranges in your full-blown RaggedExperiment that you would like to investigate
in more detail and the rowData is the typical default for storing these
characteristics given the SummarizedExperiment paradigm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants