-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timsrust sage error #15
Comments
Dear @tomthun . This seems to be diaPASEF data rather than ddaPASEF. You are correct that this indeed is not yet supported. We might look into this in the future, but please take note that this is not a trivial task as the data format is vastly different. |
@sander-willems-bruker Thank you very much for looking into this. A support for diaPASEF would be much appreciated! Edit: Do you have a rough estimate when it could be available? |
@tomthun I am not 100% sure what you mean by that. Reading diaPASEF is supported in timsrust (albeit with some limitations and at a reasonably low level). If on the other hand you are asking for diaPASEF support on sage, I am unsure if there are any plans in the near future to support DIA data in general for sage. Would you mind elaborating? |
I actually thought sage would already support DIA (only roughly but with chimeric deconvolution
I am currently a bit confused what is actually supported and what not. For me it seems that generally DIA is supported. Maybe @lazear could elaborate. I would be very happy if DIA would get a full support if possible! :) |
There is nothing preventing you from theoretically converting your diaPASEF files to mzML/MGF and searching them with Sage - Sage has no concept of PASEF internally though, so I don't know how the results will look. You can also go and search Thermo/Agilent/etc data in DIA/WWA mode... Bruker is just a bit special 😉 Currently, only reading ddaPASEF .d files is supported,since they can be easily converted to the same internal representation of a spectrum as data from the other vendors |
Thanks for clarifying @tomthun! You are right @lazear, when I mentioned supporting DIA I mean more on the dia-umpire or peptide-centric search side of things (sorry for disregarding your contribution to open search MS 😉 ) Going into why bruker is special and why converting ddaPASEF is trivial but not diaPASEF. (correct me if I am wrong ... this is more my intuition of the problem than a hard answer ...) ddaPASEF, by virtue of being targeted on the ion mobility space, can be converted to a regular "spectrum" by just collapsing the mobility dimension, thus generating a "profile" scan that can be centroided. On the other hand, if the ion mobility is not targeted, collapsing the ion mobility dimension would lead to a lot of information loss, which makes this conversion less practical (and in my experience noisy enough to not be usable in practice). I would love to hear your thoughts on "centroiding approaches" viable for diaPASEF! |
It might come with some surprise but there's actually a package for this called https://github.com/MannLabs/timspeak for clustering and https://github.com/MannLabs/alphasynchro for pseudo MSMS generation. The Package works voth for synchroPasef as well as diaPasef. The workflow was never published as part of a paper but is based on very rigorous work from @swillems. |
could you collapse an entire pasef window into a 2D centroided information table? Basically just create one giant mass spectrum that ultimately adds up hundreds of individual mass scans? I have no idea what the data structure looks like or whether this would make sense, but from a technological point of view: I think the information of the ion mobility axis can definitely be compressed / information can be lost as the ion mobility just inherently has a very low separation power of different ions as compared to the m/z axis.... So I think it would be totally okay to collapse all ions of a PASEF window into just one single MS scan with a centroided m/z info and a centroided ion mobility info. This should also drastically reduce file size, correct? I am probably misunderstanding something here so please feel free to ignore me ;) or is this already happening with the clustering in timspeak? Does timspeak actually support writing output to mzml format?
a 5 minute DIA is 1.4GB in size as .d folder. Converted with standard MSConvert settings it is 25GB.... I could of course next time activate gzip on top, but a 30min gradient..... it would still be a huge file. there is the timsconvert package https://github.com/gtluu/timsconvert which somehow generates smaller mzml files, but when I push those through sage, it cannot find any precursors with q_val < 0.01. It does find around 25.000 precursors in the 5min gradient, but all of them have a q_val > 0.11.... I guess they also use a costum annotation of ion mobility, which is probably not supported by sage? I know this is no priority of yours, but I would still love to hear your thoughts! I can also share the data if you are interested of course! Best, Klemens edit: |
@KlemensFroehlich I believe we are saying the same thing, same as @GeorgWa. But the point I was trying to make (and apparently didn't convey correctly) was that the exact way in which this centroiding needs to happen is not a trivial decision (there are 19 hyperparameters in the timspeak implementation). I would love to know if @sander-willems-bruker has any guidelines on what method+parameters would be "good enough" for most purposes or more specifically why has the centroiding been implemented only for ddaPASEF MS2 scans. best! |
Yes @KlemensFroehlich what you get is in principle a table which is represented as a list of spectra with defined mobility and RT. The workflow itself is very robust but as you note @jspaezp, it's not a full end to end search engine. Ideally, there would be a minimal required parameter set and all other parameters are optimized in a feedback loop with the search. Based on my experience centroiding is quite hard to get right and there are no universal parameters. It usually works best if done as a well controlled optimization task. |
hi everyone, thanks for clearing things up for me! @GeorgWa if you aim to get an iterative approach of centroiding based on the search results, this means it is not your intention to "just" centroid (and provide an mzml after doing so). It would rather be a long term goal to provide an end to end solution from .d folder to search / quant results? Going on a rant here, please again ignore me if I ask very stupid questions: Just looking at a background signal here of a highly abundant precursor, which roughly spans from 0.66 to 0.76: Most relevant peptide ions are located between 0.7 and 1.3 1/k0. So please forgive my naivite when I ask: Has anyone tried simple binning of the ion mobility dimension? Maybe with overlap of the binning boundaries? This would also provide a potentially super quick approach, would reduce file size tremendously and still preserve the ion mobility info. This seems to be implemented in MSconvert. Is that not sufficient here or what are your thoughts on this? Best Klemens |
Hi Klemens, Yes, one would finetune the hyperparameters based on confident identifications. I'm speaking from a purely theoretical perspective though. We are not planning this at the moment. To my knowledge timspeak only supports mgf. I share your intuition on the resolution of the ion mobility dimension. I think the main benefit manifests on the fragment level, where the strong correlation between precursor mz and the mobility is not an issue. Let's do two things: I will ping Sander and ask him if there is a way to perform centroiding only on the ion mobility level, not on RT and without connecting clusters. I'm also happy to discuss the general matter in more detail on Zoom if you like. @jspaezp or @tomthun or whoever is interested are of course invited to join. Best, |
Hi all. Been awhile since this topic was active, but I finally managed to make some headway on this issue. In the upcoming timsrust 0.3.0, dia windows can be read as if they were spectra. Annotating with sage directly seems to work out of the box with this! Results are not really impressive compared to peptide-centric analyses, but that might also be because I use a very slim (10-20aa, no misscleavages, ...) sage/fasta config for quick testing. Good news is, it takes less than a minute to find ~15K peptides in a 22min hela dia gradient directly from the raw.d (1.7GB), so very useful for some quick QC! |
PR for sage now available: lazear/sage#140 |
Took me a while to track how its handled (since so much changed from 0.2 -> 0.3) .... so for anyone else that might want to know more the implementation is here: timsrust/src/io/readers/spectrum_reader/tdf/dia.rs Lines 50 to 78 in 21efdf9
aaaand ... timsrust/src/utils/vec_utils.rs Lines 6 to 35 in 21efdf9
So if I understand correctly ... all also ... would you mind providing the sage config you used? |
Fully correct @jspaezp ! To the best of my knowledge this is how it is "traditionally" also done for Sciex and Thermo instruments that do no have an IMS dimension. As far as this topic goes, the illustration from @KlemensFroehlich is as close as it gets, although there is no IM centroiding but everything is just straight up projected on the TOF (mz) axis.
|
The config I use for testing (functionality wise, not to get the highest numbers out) of dia: Side note, since @KlemensFroehlich actually mentioned the use case specifically: Short gradients tend to yield too complex dia "spectra". I can get some results on 5min gradient, but longer gradients definately yield less complex spectra which are far easier to annotate (at least with the search settings I use) |
Hi all 👋 - I'm just curious if we've explored "centroiding" (I use that term loosely here) the collapsed spectrum along the TOF indices. I could imagine that the same ion measured across multiple spectra in the ion mobility domain could potentially have slightly different TOF indices. If we combine these into a single peak in the collapsed spectrum, then it could potentially boost signal-to-noise. There are a few ways we could do it, but since we're already dealing with a discrete representation of m/z, perhaps something like a simplified version of the peak-picking algorithm used by PTM-Shepherd for aggregating mass shifts in open-modification searches would work well:
Feel free to ignore this comment if it's out-of-scope 😃 |
hi all, Over the last couple of month I have thought about this at length and actually I think I have an idea how to improve this spectra aggregation vastly (says the uninformed non-informatician who does not know about knitty gritty details such as pesky programming or the reality thereof :P ) Okay so the "problem" I see with my own suggestion and the current implementation is noise. I propose to only use a fraction of the scans with the same quadrupole isolation margins. with a "rolling average" scheme like so: the idea being that we capture practically all features nicely while still reducing the number of MS2 scans tremendously. now I am saying aggregated scans because I stumbled over this cool publication a while back: Would it be possible to implement this or a similar algorithm here? If I understand correctly the slight changes in m/z should be taken care of and outliers should be rejected (which would be random noise that is not repeatedly observed between adjacent MS2 spectra). Wills idea might also work here? Best Klemens |
Dear @wfondrie , this is actually being done to some extent in the TOF/MZ domain. @jspaezp message (#15 (comment)) only shows how to create the initial raw spectrum in the DIA case. After that creation, it is always smoothed and centroided regardless of it being DIA or DDA: |
I have had comparable ideas as well. It actually is not that difficult to implement and we could play around with this reasonably well. My main concern with this would be transparancy and user-friendliness of the code. It will require at least a parameter to determine the aggregation width (although I agree there might be some trivial options, this still needs to be transparent for the end-user), and perhaps even more (e.g. for overlapping scans such as scan 2 that has 50% overlapping with 1 and 3 in your illustration). That said, I do want to emphasize that timsrust is only intended to be a quick library for TIMSTOF data reading. If we want to enable more elaborate processing steps, I would be in favor of doing this in a separate crate altogether. I would not even be opposed to move all spectrum processing as it is currently implemented to such a crate completely since in my opinion this is already not raw data anymore. |
Timstof .d directories are now natively supported by rust: lazear/sage#117 (comment), using timsrust to load them.
I just ran some quick tests on some test data and got the following:
(base) PS D:\Data\tools\SAGE> sage .\current_config.json
[2024-02-09T11:03:06Z INFO sage] generated 111120583 fragments, 5992882 peptides in 7285ms
[2024-02-09T11:03:06Z INFO sage] processing files 0 .. 1
thread 'main' panicked at C:\Users\runneradmin.cargo\registry\src\index.crates.io-6f17d22bba15001f\timsrust-0.2.0\src\file_readers\common\sql_reader.rs:30:62:
called Result::unwrap() on an Err value: SqliteFailure(Error { code: Unknown, extended_code: 1 }, Some("incomplete input"))
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
I use the current_config.json for sage.
The developer of sage error suggests that this is an issue in Bruker's timsrust library hence i wanted to ask here if there is a solution?
Edit: The raw data can be found here. Sorry, i had the wrong settings for the file, but should now work.
The text was updated successfully, but these errors were encountered: