Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

couldn't find matching transcriptome, returning non-ranged SummarizedExperiment (alignment mode) #72

Closed
errcricket opened this issue Dec 5, 2022 · 4 comments

Comments

@errcricket
Copy link

Greetings,

I am following this vignette: https://bioconductor.org/packages/release/bioc/vignettes/tximeta/inst/doc/tximeta.html#Pre-computed_checksums.

I am trying to use tximeta on data from salmon's alignment-based mode and I am getting a couldn't find matching transcriptome, returning non-ranged SummarizedExperiment error.

I believe I am using the most recent versions of Bioconductor and Tximeta. I am using tximeta 1.16.0 and BiocVersion 3.16.

> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.6 (Ootpa)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblaso-r0.3.15.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocGenerics_0.44.0 tximeta_1.16.0     

loaded via a namespace (and not attached):
  [1] ProtGenerics_1.30.0           bitops_1.0-7                  matrixStats_0.63.0            bit64_4.0.5                  
  [5] filelock_1.0.2                progress_1.2.2                httr_1.4.4                    GenomeInfoDb_1.34.3          
  [9] tools_4.2.1                   utf8_1.2.2                    R6_2.5.1                      DBI_1.1.3                    
 [13] lazyeval_0.2.2                withr_2.5.0                   tidyselect_1.2.0              prettyunits_1.1.1            
 [17] bit_4.0.5                     curl_4.3.3                    compiler_4.2.1                cli_3.4.1                    
 [21] Biobase_2.58.0                xml2_1.3.3                    DelayedArray_0.24.0           rtracklayer_1.58.0           
 [25] readr_2.1.3                   rappdirs_0.3.3                stringr_1.4.1                 digest_0.6.30                
 [29] Rsamtools_2.14.0              XVector_0.38.0                pkgconfig_2.0.3               htmltools_0.5.3              
 [33] MatrixGenerics_1.10.0         dbplyr_2.2.1                  fastmap_1.1.0                 ensembldb_2.20.2             
 [37] rlang_1.0.6                   rstudioapi_0.14               RSQLite_2.2.18                shiny_1.7.3                  
 [41] BiocIO_1.8.0                  generics_0.1.3                jsonlite_1.8.3                vroom_1.6.0                  
 [45] BiocParallel_1.32.1           dplyr_1.0.10                  RCurl_1.98-1.9                magrittr_2.0.3               
 [49] GenomeInfoDbData_1.2.9        Matrix_1.5-3                  Rcpp_1.0.9                    S4Vectors_0.36.0             
 [53] fansi_1.0.3                   lifecycle_1.0.3               stringi_1.7.8                 yaml_2.3.6                   
 [57] SummarizedExperiment_1.28.0   zlibbioc_1.44.0               BiocFileCache_2.4.0           AnnotationHub_3.4.0          
 [61] grid_4.2.1                    blob_1.2.3                    parallel_4.2.1                promises_1.2.0.1             
 [65] crayon_1.5.2                  lattice_0.20-45               Biostrings_2.66.0             GenomicFeatures_1.50.2       
 [69] hms_1.1.2                     KEGGREST_1.38.0               pillar_1.8.1                  GenomicRanges_1.50.1         
 [73] rjson_0.2.21                  codetools_0.2-18              biomaRt_2.52.0                stats4_4.2.1                 
 [77] XML_3.99-0.12                 glue_1.6.2                    BiocVersion_3.16.0            BiocManager_1.30.19          
 [81] tzdb_0.3.0                    png_0.1-7                     vctrs_0.5.1                   httpuv_1.6.6                 
 [85] purrr_0.3.5                   assertthat_0.2.1              cachem_1.0.6                  mime_0.12                    
 [89] xtable_1.8-4                  restfulr_0.0.15               AnnotationFilter_1.22.0       later_1.3.0                  
 [93] tibble_3.1.8                  GenomicAlignments_1.34.0      AnnotationDbi_1.60.0          memoise_2.0.1                
 [97] IRanges_2.32.0                tximport_1.26.0               ellipsis_0.3.2                interactiveDisplayBase_1.36.0

How I generated my data from paired fastq files...

# define variables
REFERENCE_ENSEMBLE_FASTA="reference_data/mouse/GRCm39/fasta/Mus_musculus.GRCm39.dna.primary_assembly.fa" 
REFERENCE_ENSEMBLE_TRANSCRIPT="reference_data/mouse/GRCm39/fasta/Mus_musculus.GRCm39.cdna.all.fa" 
REFERENCE_ENSEMBLE_CHR_GTF="reference_data/mouse/GRCm39/annotations/Mus_musculus.GRCm39.108.chr.gtf"
GFFREAD_TRANSCRIPT_FILE_CHR="reference_data/mouse/GRCm39/fasta/Mus_musculus.GRCm39.108.chr_GFFREAD.fa"
SALMON_TRANSCRIPTS_INDEX="reference_data/mouse/GRCm39/salmon_transcripts_index/"

# create transcript file
gffread -w $GFFREAD_TRANSCRIPT_FILE_CHR -g $REFERENCE_ENSEMBLE_FASTA $REFERENCE_ENSEMBLE_CHR_GTF

# create star index folder
STAR --runThreadN 20 --runMode genomeGenerate --genomeDir $STAR_INDEX_DIR_CHR --genomeFastaFiles $REFERENCE_ENSEMBLE_FASTA --sjdbGTFfile $REFERENCE_ENSEMBLE_CHR_GTF --sjdbOverhang 75

# align each fastq files (paired) to reference, make bam files
STAR --runThreadN 18 --genomeDir $STAR_INDEX_DIR_CHR --readFilesIn $fq1 $fq2 --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM GeneCounts --readFilesCommand zcat --outFileNamePrefix $aligned_sample_dir_chr --outReadsUnmapped Fastx

#create Salmon index
salmon index -t $REFERENCE_ENSEMBLE_FASTA -i $SALMON_TRANSCRIPTS_INDEX --decoys decoys.txt -k 31 -p 12

# quantify aligned bam files with salmon
salmon quant -t $GFFREAD_TRANSCRIPT_FILE_CHR -l A -p 10 -a $aligned_sample_dir_chr"Aligned.toTranscriptome.out.bam" -o $salmon_quant
suppressPackageStartupMessages(library('BiocVersion', lib.loc='/home/reichenbee/franco_lab/tools/R421'))

#reading in sample information
csvfile <- read.csv('experimental_groups.txt', header=T, sep='\t')

coldata <- csvfile
coldata$files <- file.path('salmon_quant_CHR', coldata$sample, 'quant.sf') #add file names/location
colnames(coldata)[1] <- 'names'
coldata <- coldata[14, ] # take one sample
coldata

#   names  group genotype                quant                         files
# 14   B69 cancer     XXXX salmon_quant_CHR/B69 salmon_quant_CHR/B69/quant.sf

se <- tximeta(coldata)

# importing quantifications
# reading in files with read_tsv
# 1 
# couldn't find matching transcriptome, returning non-ranged SummarizedExperiment

Based on this older post: #38, I went into one of the cmd_info.json files and placed "index" and path to the salmon index directory inside.

{
    "salmon_version": "1.9.0",
    "targets": "reference_data/mouse/GRCm39/fasta/Mus_musculus.GRCm39.108.chr_GFFREAD.fa",
    "libType": "A",
    "threads": "10",
    "alignments": "star_aligned_sample_files_CHR/B69/Aligned.toTranscriptome.out.bam",
    "output": "salmon_quant_CHR/B69/",
    "auxDir": "aux_info",
    "index": "salmon_transcripts_index/"
}

I re-ran it and received the same error message.

I did try adding hash strings as suggested, but I was not really sure how to organize them inside the aux_info/meta_info.json file and got errors like this:

Error in parse_con(txt, bigint_as_char) : 
  parse error: invalid object key (must be a string)
          bc4deb376dcf88a2e45878911e", } 
                     (right here) ------^

Aside from using an incorrect reference file, I am out of error origin ideas. Any information that can resolve this error is greatly appreciated.

@mikelove
Copy link
Collaborator

mikelove commented Dec 5, 2022

So if I'm reading correctly, the Ensembl-based index is not being used during quantification. I'm not sure it's a good idea to hack the metadata in this way. It looks like you want tximeta to see one index while it's not being used for quantification. This kind of breaks the idea of intrinsically connecting quantification output to the provenance of the metadata used for quantification. Potentially the downstream functions of tximeta won't be guaranteed to work or give correct output.

Maybe I'm misreading the code though.

I think an easier way to achieve your desired result would be skipMeta=TRUE which gives you an SE, and then just manually add the rowRanges() <- with the GRanges of transcripts?

@errcricket
Copy link
Author

errcricket commented Dec 5, 2022

Thank you for the speedy response!

I doubt you are misreading my code. I have never used Tximeta before and I am using my own data while trying to work through the vignette (https://www.bioconductor.org/packages/devel/bioc/vignettes/tximeta/inst/doc/tximeta.html).

You are correct that the Ensembl-based index is not being used during quantification (as in mentioned in the alignment-based mode of Salmon).

I reverted back to the original json files and adding skipMeta=TRUE allowed the command to run. Thank you.

I am not sure how to handle your rowRanges() comment. Is there a way to automate this for alignment-based data so that I can continue to work through the vignette/tutorial?

@mikelove
Copy link
Collaborator

mikelove commented Dec 5, 2022

So, re: rowRanges(), this is entirely optional. This is if you want to have GRanges associated with your rows, but you don't need to have this for 99% of Bioconductor workflows, e.g. DE or gene set enrichment.

We automate this when we recognize the index used for quantification, but it's just a "nice to have", you can actually just skip this and work with the SE data.

@errcricket
Copy link
Author

Thank you Mike.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants