You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
The issue existed in both bulk and single-cell mode
Describe the bug
When using Salmon to quantify non-redundant (NR) genes in metagenomic datasets, the generated output is missing a summary for nucleotide IDs that correspond to multiple sequences.
To Reproduce
Steps and data to reproduce the behavior:
Merging quantifications with Salmon:
salmon quantmerge
--quants temp/salmon/L1EHI0900465--Q_S1_N6.quant
-o result/salmon/gene_L1EHI0900465--Q_S1_N6.TPM
Searching for a specific gene ID in the quantification file:
grep "k141_1346622_1" temp/salmon/L1EHI0900465--Q_S1_N6.quant/quant.sf
Multiple lines are found for this gene ID
Searching for the same gene ID in the resulting TPM file:
grep "k141_1346622_1" result/salmon/gene_L1EHI0900465--Q_S1_N6.TPM
#No results are found, which is unexpected
Specifically, please provide at least the following information:
Which version of salmon was used? salmon 1.4.0
How was salmon installed (compiled, downloaded executable, through bioconda)? conda install salmon -y
Which reference (e.g. transcriptome) was used? metagenome data
Which read files were used? L1EHI0900465--Q_S1_N6.quant/
Which which program options were used?
salmon quantmerge
--quants temp/salmon/L1EHI0900465--Q_S1_N6.quant
-o result/salmon/gene_L1EHI0900465--Q_S1_N6.TPM
Expected behavior
A clear and concise description of what you expected to happen.
I hope to keep all the gene IDs and for those who contains more than one line, take average values for each gene ID.
Screenshots
If applicable, add screenshots or terminal output to help explain your problem.
Desktop (please complete the following information):
OS: [e.g. Ubuntu Linux, OSX]
Version [ If you are on OSX, the output of sw_vers. If you are on linux the output of uname -a and lsb_release -a]
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
Updated Expected behavior:
A clear and concise description of what you expected to happen.
I aim to retain all gene IDs, and for those represented by multiple lines, I intend to calculate the sum of values for each unique gene ID.
I came across a few posts regarding this issue, but have not found a good solution for salmon quantmerge yet
jiazhou0116
changed the title
salmon quantmerge skipped the nucleotide IDs that have multiple sequences
salmon quantmerge skipped the nucleotide IDs that have multiple sequences - Metagenome dataset
Jan 31, 2024
Year 2018, in issue #214 (#214), --keepDuplicates was suggested for dealing with transcript duplicates. https://combine-lab.github.io/salmon/faq/ also mentioned "If you really want to go through with quantification of sequence duplicates. You can pass --keepDuplicates to the salmon indexing command. This will tell salmon not to discard these duplicates, and they will appear in the output quantifications." But from my understanding, this is for sequence-indentical duplicate, but for our case, the sequences and sequences' full annotations are different, but the shortened gene ID before "#" can be identical for multiple sequences.
e.g.,
After salmon quant step, the gene_ID will be shorted but all will be keeped even though same gene_ID have different lengths etc
Name Length EffectiveLength TPM NumReads
k97_3_1 534 216.520 0.000000 0.000
k97_5_1 384 99.234 0.000000 0.000
k97_6_1 333 73.044 0.000000 0.000
k97_9_1 387 101.041 0.000000 0.000
however, at salmon quantmerge step, the gene_ID with multiple sequences are removed.
Name NP1.clean.quant
k141_743617_3 0
k141_742060_5 0
k141_910930_3 0.015907
k141_1078715_3 0
k141_527785_4 0
This will cause the whole dataset lose the most information gene information, since those genes with multiple sequences may play an important biological roles. So I think i need to take some actions to keep all the genes by relabeling those who have multiple sequences by order them. Not sure whether this is something I can do through salmon quantmerge.
Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
The issue existed in both bulk and single-cell mode
Describe the bug
When using Salmon to quantify non-redundant (NR) genes in metagenomic datasets, the generated output is missing a summary for nucleotide IDs that correspond to multiple sequences.
To Reproduce
Steps and data to reproduce the behavior:
salmon quantmerge
--quants temp/salmon/L1EHI0900465--Q_S1_N6.quant
-o result/salmon/gene_L1EHI0900465--Q_S1_N6.TPM
grep "k141_1346622_1" temp/salmon/L1EHI0900465--Q_S1_N6.quant/quant.sf
Multiple lines are found for this gene ID
grep "k141_1346622_1" result/salmon/gene_L1EHI0900465--Q_S1_N6.TPM
#No results are found, which is unexpected
Specifically, please provide at least the following information:
salmon quantmerge
--quants temp/salmon/L1EHI0900465--Q_S1_N6.quant
-o result/salmon/gene_L1EHI0900465--Q_S1_N6.TPM
Expected behavior
A clear and concise description of what you expected to happen.
I hope to keep all the gene IDs and for those who contains more than one line, take average values for each gene ID.
Screenshots
If applicable, add screenshots or terminal output to help explain your problem.
Desktop (please complete the following information):
sw_vers
. If you are on linux the output ofuname -a
andlsb_release -a
]Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: