Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with function cnvOncoPrint using a colorectal cancer dataset of 34 WES samples #34

Open
Jasonmbg opened this issue Oct 27, 2021 · 2 comments

Comments

@Jasonmbg
Copy link

Dear @lgeistlinger,

moving from email where I tried to reach you for a specific error appeared when trying to implement CNVRanger for 34 WES samples from colorectal cancer patients:

briefly, based on an international collaborative project between DKFZ and other Greek research institutions regarding personalized medicine in cancer (http://www.accc.gr/aboutACCC_info.html), we have acquired around 34 samples with DNA and RNA which were sequenced and pre-processed in the DKFZ bioinformatics facility. Overall, my major goal is to investigate, if there any specific mutational patterns in any of the 3 separated groups of patients, that might interrelate the presence of specific mutational patterns (KRASonly, BRAFonly and wild type), as the ultimate goal is to investigate the molecular landscape of these 3 defined groups-we also utilize additionally public TCGA data to enhance our sample size-

My major question would be for the actual analysis of the CNV data from our 34 patients-as already copy number alterations were called with cnvkit, and as a further step purity and ploidy estimated; for a start, I tried to analyze all data collectively, based on the segmented log2 rations, as in your vignette, with the following code:

head(DT)
# A tibble: 6 x 5
  file        chromosome    start       end     log2
  <chr>       <chr>         <int>     <int>    <dbl>
1 CRC_ACCC_01 1             12080   3743427  0.191  
2 CRC_ACCC_01 1           3745760  12942340  0.00996
3 CRC_ACCC_01 1          12942900  13802754 -0.278  
4 CRC_ACCC_01 1          13803254  43766223 -0.0250 
5 CRC_ACCC_01 1          43766723  44684459  0.107  
6 CRC_ACCC_01 1          44684588 103345463 -0.0443 

smean <- DT$log2
state <- round(2^smean * 2)
state[state > 4] <- 4
DT$log2 <- state
names(DT)[names(DT) == "log2"] <- "state"
DT <- DT[DT$state != 2,]
head(DT)
# A tibble: 6 x 5
  file        chromosome     start       end state
  <chr>       <chr>          <int>     <int> <dbl>
1 CRC_ACCC_01 1          109778935 109823258     3
2 CRC_ACCC_01 1          152080055 152329130     3
3 CRC_ACCC_01 2              10500  28856096     3
4 CRC_ACCC_01 2           28863705  71742899     3
5 CRC_ACCC_01 2           71743258  74786985     3
6 CRC_ACCC_01 2           74787273  92325671     3

grl <- GenomicRanges::makeGRangesListFromDataFrame(DT, split.field="file", keep.extra.columns=TRUE)

saveRDS(grl, file = "Input.test.grl.CRC.rds")

grl.2 <- populationRanges(grl, density=0.1, est.recur=FALSE)

sel.freq.grl <- subset(grl.2, freq > 1)

library(EnsDb.Hsapiens.v75)
human.hg19.genes <- ensembldb::genes(EnsDb.Hsapiens.v75)

sel.hg19.genes <- subset(human.hg19.genes, gene_biotype == "protein_coding")

olaps <- GenomicRanges::findOverlaps(sel.hg19.genes, sel.freq.grl, ignore.strand=TRUE)

qh <- S4Vectors::queryHits(olaps)

sh <- S4Vectors::subjectHits(olaps)

cgenes <- sel.hg19.genes[qh]

cgenes$type <- sel.freq.grl$type[sh]

However, initially in the following step of trying the visualization through the function cnvOncoPrint:

cnvOncoPrint(sel.freq.grl, cgenes, top.features=15, top.samples = 34)
Error in as(calls, "RaggedExperiment") : 
  no method or default for coercing "GRanges" to "RaggedExperiment"

cnvOncoPrint(grl, cgenes, top.features=15, top.samples = 34)
Following `at` are removed: GAIN1, GAIN2, because no color was defined for them.
Error: Length of `graphics` should be the same as number of labels.

Thus, in your opinion, how this error could be fixed? It is something wrong with any of the input files? For simplicity I have attached also the initial created GRangeslist object;

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] EnsDb.Hsapiens.v75_2.99.0 ensembldb_2.17.4          AnnotationFilter_1.17.1  
 [4] GenomicFeatures_1.45.2    AnnotationDbi_1.55.1      Biobase_2.53.0           
 [7] CNVRanger_1.9.0           RaggedExperiment_1.17.4   GenomicRanges_1.45.0     
[10] GenomeInfoDb_1.29.8       IRanges_2.27.2            S4Vectors_0.31.5         
[13] BiocGenerics_0.39.2       data.table_1.14.2         forcats_0.5.1            
[16] stringr_1.4.0             dplyr_1.0.7               purrr_0.3.4              
[19] readr_2.0.2               tidyr_1.1.4               tibble_3.1.5             
[22] ggplot2_3.3.5             tidyverse_1.3.1

Input.test.grl.CRC.rds.zip

With Kind Regards,

Efstathios

@lgeistlinger
Copy link
Collaborator

Hi Efstathios,

cnvOncoPrint(sel.freq.grl, cgenes, top.features=15, top.samples = 34)

Note that cnvOncoPrint requires the actual CNV calls, it will not work with the summarized CNV regions that you provided here.

cnvOncoPrint(grl, cgenes, top.features=15, top.samples = 34)

This should work now using the new version of the CNVRanger package (version 1.11.2). You can install the new version 1.11.2 of the CNVRanger package in a fresh R session directly from github via

BiocManager::install("waldronlab/CNVRanger")

This works now on my end, when applying cnvOncoPrint on the input Input.test.grl.CRC that you provided.

Hope that helps.

Best,
Ludwig

@Jasonmbg
Copy link
Author

Jasonmbg commented Feb 18, 2022

Dear @lgeistlinger,

initially I would like to thank you for your response !! As I re-install the updated version, I repeated the analysis, but using this time also updated total copy numbers (as in my previous attached object I have included log2 segmented ratios) to improve inference; also for reproducibility, I have also attached the saved rds file of the updated input-using the same cohort:

grl
GRangesList object of length 28:
$ACCC_01_CRC
GRanges object with 54 ranges and 1 metadata column:
       seqnames             ranges strand |     state
          <Rle>          <IRanges>  <Rle> | <integer>
   [1]        2     10500-28856096      * |         3
   [2]        2  28863705-71742899      * |         3
   [3]        2  71743258-74786985      * |         3
   [4]        2  74787273-92325671      * |         3
   [5]        2 95326671-132111157      * |         3
   ...      ...                ...    ... .       ...
  [50]       18     10500-15410398      * |         1
  [51]       18  19348569-78016748      * |         1
  [52]       20     68259-26319069      * |         3
  [53]       20  29420069-60419852      * |         3
  [54]       20  60427797-62959343      * |         3
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

...
<27 more elements>

grl.2 <- populationRanges(grl, mode="RO", est.recur=TRUE)

grl.2
GRanges object with 191 ranges and 3 metadata columns:
        seqnames              ranges strand |      freq        type    pvalue
           <Rle>           <IRanges>  <Rle> | <numeric> <character> <numeric>
    [1]        1       12080-1640274      * |         3        both  0.640000
    [2]        1    1640337-39340786      * |         3        both  0.927273
    [3]        1 104114727-121484934      * |         3        both  0.715152
    [4]        1 142542801-145366468      * |         3        gain  0.606061
    [5]        1 142542801-249231246      * |         5        gain  0.406061
    ...      ...                 ...    ... .       ...         ...       ...
  [187]        X 153760586-155257641      * |        16        both      0.42
  [188]        Y    2655076-10104053      * |        16        loss      0.00
  [189]        Y    8141456-10104053      * |        16        loss      0.00
  [190]        Y    9149432-10104053      * |        16        loss      0.00
  [191]        Y   13105053-28704244      * |        16        loss      0.03
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

sel.freq.grl <- subset(grl.2, freq > 3)

sel.freq.grl
GRanges object with 148 ranges and 3 metadata columns:
        seqnames              ranges strand |      freq        type    pvalue
           <Rle>           <IRanges>  <Rle> | <numeric> <character> <numeric>
    [1]        1 142542801-249231246      * |         5        gain  0.406061
    [2]        1 147473475-149761744      * |         5        gain  0.406061
    [3]        2    8822295-71742899      * |         5        both  0.606061
    [4]        2   44565499-90259359      * |         5        both  0.606061
    [5]        2  95326671-117510345      * |         4        gain  0.472727
    ...      ...                 ...    ... .       ...         ...       ...
  [144]        X 153760586-155257641      * |        16        both      0.42
  [145]        Y    2655076-10104053      * |        16        loss      0.00
  [146]        Y    8141456-10104053      * |        16        loss      0.00
  [147]        Y    9149432-10104053      * |        16        loss      0.00
  [148]        Y   13105053-28704244      * |        16        loss      0.03
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

library(EnsDb.Hsapiens.v75)
human.hg19.genes <- ensembldb::genes(EnsDb.Hsapiens.v75)
sel.hg19.genes <- subset(human.hg19.genes, gene_biotype == "protein_coding")
olaps <- GenomicRanges::findOverlaps(sel.hg19.genes, sel.freq.grl, ignore.strand=TRUE)

qh <- S4Vectors::queryHits(olaps)

sh <- S4Vectors::subjectHits(olaps)

cgenes <- sel.hg19.genes[qh]

cgenes$type <- sel.freq.grl$type[sh]
cgenes$freq <- sel.freq.grl$freq[sh]

 head(cgenes)
GRanges object with 6 ranges and 8 metadata columns:
                  seqnames              ranges strand |         gene_id   gene_name   gene_biotype
                     <Rle>           <IRanges>  <Rle> |     <character> <character>    <character>
  ENSG00000236334        1 143767144-143767881      - | ENSG00000236334     PPIAL4G protein_coding
  ENSG00000215784        1 143896452-143913143      - | ENSG00000215784      FAM72D protein_coding
  ENSG00000162825        1 144146808-144224481      + | ENSG00000162825       NBPF8 protein_coding
  ENSG00000231360        1 144339738-144521058      + | ENSG00000231360  AL592284.1 protein_coding
  ENSG00000255854        1 144363462-144364246      - | ENSG00000255854     PPIAL4B protein_coding
  ENSG00000168614        1 144811744-144830413      + | ENSG00000168614       NBPF9 protein_coding
                  seq_coord_system      symbol                    entrezid        type      freq
                       <character> <character>                      <list> <character> <numeric>
  ENSG00000236334       chromosome     PPIAL4G                      644591        gain         5
  ENSG00000215784       chromosome      FAM72D                      728833        gain         5
  ENSG00000162825       chromosome       NBPF8      101928050,728841,25832        gain         5
  ENSG00000231360       chromosome  AL592284.1            101929383,730256        gain         5
  ENSG00000255854       chromosome     PPIAL4B 100996725,653598,653505,...        gain         5
  ENSG00000168614       chromosome       NBPF9    728912,284565,200030,...        gain         5
  1. Then, when I used the function cnvOncoPrint(grl, cgenes, top.features=15, top.samples = -1); Now it run without an error, but from the attached produced plot, why all the samples are included? The total samples are 28, I used the option -1, but still not all samples are illustrated; is there a rationale for this?

CNVOncoplot.test.tiff.zip

  1. In addition something very important, that is related to the downstream annotation of unique gene symbols/copy number states, to have a final matrix of unique genes x Samples, with copy number states:

ra.grl <- RaggedExperiment::RaggedExperiment(grl) # Start using the initial GRanges object from start with all genomic regions

Then, subset directly to the protein coding genes from above step:

xx2 <- qreduceAssay(ra.grl, cgenes, simplifyReduce = CNVRanger:::.largest)

head(xx2)
                        ACCC_01_CRC ACCC_02_CRC ACCC_03_CRC ACCC_05_CRC ACCC_06_CRC ACCC_08_CRC
1:143767144-143767881:-          NA          NA          NA          NA          NA          NA
1:143896452-143913143:-          NA          NA          NA          NA          NA          NA
1:144146808-144224481:+          NA          NA          NA          NA          NA          NA
1:144339738-144521058:+          NA          NA          NA          NA          NA          NA
1:144363462-144364246:-          NA          NA          NA          NA          NA          NA
1:144811744-144830413:+          NA          NA          NA          NA          NA          NA
                        ACCC_09_CRC ACCC_11_CRC ACCC_12_CRC ACCC_13_CRC ACCC_14_CRC ACCC_15_CRC
1:143767144-143767881:-          NA          NA          NA          NA          NA          NA
1:143896452-143913143:-          NA          NA          NA          NA          NA          NA
1:144146808-144224481:+          NA          NA          NA          NA          NA          NA
1:144339738-144521058:+          NA          NA          NA          NA          NA          NA
1:144363462-144364246:-          NA          NA          NA          NA          NA          NA
1:144811744-144830413:+          NA          NA          NA          NA          NA          NA
                        ACCC_16_CRC ACCC_17_CRC ACCC_18_CRC ACCC_19_CRC ACCC_20_CRC ACCC_21_CRC
1:143767144-143767881:-          NA          NA           4          NA           4          NA
1:143896452-143913143:-          NA          NA           4          NA           4          NA
1:144146808-144224481:+          NA          NA           4          NA           4          NA
1:144339738-144521058:+          NA          NA           4          NA           4          NA
1:144363462-144364246:-          NA          NA           4          NA           4          NA
1:144811744-144830413:+          NA          NA           4          NA           4          NA
                        ACCC_22_CRC ACCC_23_CRC ACCC_25_CRC ACCC_27_CRC ACCC_28_CRC ACCC_29_CRC
1:143767144-143767881:-          NA          NA          NA          NA          NA          NA
1:143896452-143913143:-          NA          NA          NA          NA          NA          NA
1:144146808-144224481:+          NA          NA          NA          NA          NA          NA
1:144339738-144521058:+          NA          NA          NA          NA          NA          NA
1:144363462-144364246:-          NA          NA          NA          NA          NA          NA
1:144811744-144830413:+          NA          NA          NA          NA          NA          NA
                        ACCC_30_CRC ACCC_32_CRC ACCC_33_CRC ACCC_34_CRC
1:143767144-143767881:-          NA          NA           3          NA
1:143896452-143913143:-          NA          NA           3          NA
1:144146808-144224481:+          NA          NA           3          NA
1:144339738-144521058:+          NA          NA           3          NA
1:144363462-144364246:-          NA          NA           3          NA
1:144811744-144830413:+          NA          NA           3          NA

rownames(xx2) <- cgenes$symbol
x.sel <- xx2[!duplicated(rownames(xx2)),]

head(x.sel[1:2,])
        ACCC_01_CRC ACCC_02_CRC ACCC_03_CRC ACCC_05_CRC ACCC_06_CRC ACCC_08_CRC ACCC_09_CRC
PPIAL4G          NA          NA          NA          NA          NA          NA          NA
FAM72D           NA          NA          NA          NA          NA          NA          NA
        ACCC_11_CRC ACCC_12_CRC ACCC_13_CRC ACCC_14_CRC ACCC_15_CRC ACCC_16_CRC ACCC_17_CRC
PPIAL4G          NA          NA          NA          NA          NA          NA          NA
FAM72D           NA          NA          NA          NA          NA          NA          NA
        ACCC_18_CRC ACCC_19_CRC ACCC_20_CRC ACCC_21_CRC ACCC_22_CRC ACCC_23_CRC ACCC_25_CRC
PPIAL4G           4          NA           4          NA          NA          NA          NA
FAM72D            4          NA           4          NA          NA          NA          NA
        ACCC_27_CRC ACCC_28_CRC ACCC_29_CRC ACCC_30_CRC ACCC_32_CRC ACCC_33_CRC ACCC_34_CRC
PPIAL4G          NA          NA          NA          NA          NA           3          NA
FAM72D           NA          NA          NA          NA          NA           3          NA

However, if you see from the above x.sel matrix, PPIAL4G gene is altered in 3 patients, whereas its frequency in the cgenes object from above, is 5; how you explained this discrepancy? Is any step from above wrong? Or there is another explanation for this difference? For example, the same applies for the second gene with symbol FAM72D; which in the final patient matrix is altered in 2 patients, whereas again showed a frequency of 5 samples in cgenes; Could this be attributed to the function populationRanges, where it summarizes individual calls to more "discrete CNV regions" that overlap between samples? And when I assign the frequency from the sel.grl.freq object, it is erroneous? And frequency should only be assessed at this stage? Or it should not be 5 in the above genes, but as in the final x.sel matrix?

Just to pinpoint, in my input txt file, the only "deviations" from the tutorial were: instead chr1, chr2, the chromosome column value has 1, 2, etc.; In addition, I have include copy number states above 4, like 5, 6 as some samples have triploid phenotypes; Do I need anything from above 4 to convert it to 4, or it is not harmful?

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] EnsDb.Hsapiens.v75_2.99.0 ensembldb_2.18.3          AnnotationFilter_1.18.0  
 [4] GenomicFeatures_1.46.4    AnnotationDbi_1.56.2      Biobase_2.54.0           
 [7] Gviz_1.38.3               CNVRanger_1.11.2          RaggedExperiment_1.18.0  
[10] GenomicRanges_1.46.1      GenomeInfoDb_1.30.1       IRanges_2.28.0           
[13] S4Vectors_0.32.3          BiocGenerics_0.40.0       data.table_1.14.2        
[16] forcats_0.5.1             stringr_1.4.0             dplyr_1.0.8              
[19] purrr_0.3.4               readr_2.1.2               tidyr_1.2.0              
[22] tibble_3.1.6              ggplot2_3.3.5             tidyverse_1.3.1          
[25] pak_0.2.1

With Kind Regards,

Efstathios

Input.UpdatedCopyNumbers.grl.CRC.18022022.rds.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants