Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USI reported by quantms for DIANN experiments. #350

Open
3 tasks
ypriverol opened this issue Jan 25, 2024 · 27 comments
Open
3 tasks

USI reported by quantms for DIANN experiments. #350

ypriverol opened this issue Jan 25, 2024 · 27 comments
Assignees
Labels
bug Something isn't working dia analysis documentation Improvements or additions to documentation

Comments

@ypriverol
Copy link
Member

ypriverol commented Jan 25, 2024

Description of the bug

@WangHong007 @zprobot @daichengxin:

I have been recently testing multiple USIs generated by quantms for DIA experiments. A little background about the problem, USIs are a way to reference directly to the scan and spectrum that was used to identify the spectrum, in someway is the fundamental evidence on DDA identification; here is an example.

All DDA search engines keep track of the scan that was used to identify the spectrum. However, in DIA experiments other features are also relevant and DIANN do not trace in the output files of the scan number that was used to identify the peptide. In quantms, we have implemented a logic to "find" for every peptide the scan number used to identify the peptide. @zprobot provided all the USIs for reanalysis PXD019909; however, when we were doing a visualization of multiple USIs they look like random USI meaning the spectrum looks like do not correspond to the given peptide, see example.

I propose the following follow up:

  • @WangHong007 can you explain in a comment on this thread in details how the scan are retrieved. Probably @vdemichev can double-check that our logic is correct.
  • @zprobot can you provide for other reanalyses the corresponding USIs and the Posterior error probabilities. The idea is that we can check for the best IDs (lower PEP) in their USIs. It is important to have the USIs and PEP for other projects (reanalyses) to see if the problems are related with the specific project.
  • Double check that for this project PXD019909 the USIs are well annotated @WangHong007 @zprobot. If they are well annotated, and we check USIs for good IDs (lower PEP) we can try to discuss with @vdemichev what is the problem on this project.

Please let me know if you need more discussion.

Additional examples that looks wrong:

Command used and terminal output

No response

Relevant files

No response

System information

No response

@ypriverol ypriverol added the bug Something isn't working label Jan 25, 2024
@ypriverol ypriverol added documentation Improvements or additions to documentation dia analysis labels Jan 25, 2024
@vdemichev
Copy link

In non-tims data the apex RT reported by DIA-NN should uniquely identify the scan.
Why do those examples look wrong? They have some fragments annotated.

@ypriverol
Copy link
Member Author

Thanks, @vdemichev, for responding. We were having a look to some of them and saw not to many fragments, for example https://www.ebi.ac.uk/pride/archive/usi?usi=mzspec%3APXD019909%3A20180914_QE8_nLC0_BDA_SA_DIA_Keratinocytes_NN002%3Ascan%3A7150%3AGKQEEEKPGEEK%2F2&resultType=FULL. If you use in the viewer mass type mono and 40 ppm you will see only 4 b-ions and 2 y-ions, the Proline for example is not even identified. I'm not used to see DIA spectra in viz tools, my question is:

  • Is this kind of number of fragment annotations normally? It could be a good ID only with 6 ions.
  • Is this because apart from the fragment ions, other info is used as precursor mz and rt?

USIs and viz of spectra in DDA is quite common, see this https://www.ebi.ac.uk/pride/archive/usi?usi=mzspec:PXD000561:Adult_Frontalcortex_bRP_Elite_85_f09:scan:17555:VLHPLEGAVVIIFK/2 we are exploring and checking if this may sense for DIA to write some guidelines about DIA USIs. Your feedback is more than welcome. BTW, we are the only one's generating USIs for DIA & DIANN experiments.

@jpfeuffer
Copy link
Collaborator

Hi all,

I am currently looking into similar visualizations from quantms DIA-NN output.
I think DIA-NN uses much more than just a single MS2 spectrum for identification.
What you ideally want to display is a view as in Skyline or OpenSWATH, where you visualize the co-elution of fragment ions for a peptide along RT (potentially with competing groups from other peptides). This means for every found fragment you would want start and end RT. I am in the process of checking if this information is available and how to extract it.

DDA-like PSM visualization at the apex can be a supplementary view but might be very convoluted.
Looking forward to more input from @vdemichev, though!

@WangHong007
Copy link

Here the logic how we match diann report to ms_info.tsv:

for n, group in report.groupby(["Run"]):
if isinstance(n, tuple) and len(n) == 1:
# This is here only to support versions of pandas where the groupby
# key is a tuple.
# related: https://github.com/pandas-dev/pandas/pull/51817
n = n[0]
file = __find_info(folder, n)
target = pd.read_csv(file, sep="\t")
group.sort_values(by="RT.Start", inplace=True)
target = target[["Retention_Time", "SpectrumID", "Exp_Mass_To_Charge"]]
target.columns = ["RT.Start", "opt_global_spectrum_reference", "exp_mass_to_charge"]
# Standardize spectrum identifier format for bruker data
if type(target.loc[0, "opt_global_spectrum_reference"]) != str:
target.loc[:, "opt_global_spectrum_reference"] = "scan=" + target.loc[
:, "opt_global_spectrum_reference"
].astype(str)
# TODO seconds returned from precursor.getRT()
target.loc[:, "RT.Start"] = target.apply(lambda x: x["RT.Start"] / 60, axis=1)
out_mztab_PSH = pd.concat([out_mztab_PSH, pd.merge_asof(group, target, on="RT.Start", direction="nearest")])

Basically, we choose 'RT.Start' (seconds as unit) as reteintion time in the report and changed seconds to minutes, at last we merge report and ms_info on column 'RT.start' to make sure matching nearest reteintion times.

@vdemichev
Copy link

I would suggest to use RT instead of RT.Start

@ypriverol
Copy link
Member Author

Hi all,

I am currently looking into similar visualizations from quantms DIA-NN output. I think DIA-NN uses much more than just a single MS2 spectrum for identification. What you ideally want to display is a view as in Skyline or OpenSWATH, where you visualize the co-elution of fragment ions for a peptide along RT (potentially with competing groups from other peptides). This means for every found fragment you would want start and end RT. I am in the process of checking if this information is available and how to extract it.

DDA-like PSM visualization at the apex can be a supplementary view but might be very convoluted. Looking forward to more input from @vdemichev, though!

@jpfeuffer can you provide some ideas and visualization about examples it. I don't have access to Skyline?

@jpfeuffer
Copy link
Collaborator

jpfeuffer commented Jan 25, 2024

Me neither but this is from their tutorial:
image

The top right is I think similar to what you are currently trying to do (probably at the RT apex [= RT] not RT.start or RT.end).
You can also compare with the actual expected library intensities in a mirror plot or something.

The left charts are XICs from precursor isotopes and fragment ions. Those will require a range of MS2 spectra to extract from.

I guess for the use-case of a USI, your approach may be enough if you switch to RT instead of RT.start.

@WangHong007
Copy link

Hi @jpfeuffer @vdemichev, in the discussion above, you mentioned using apex RT to match the RT column in the diann report, and I have an additional question: When we use pyopenms to process mass spectrometry files (such as mzMLs), the getRT() function takes the scan start time in the mzml file as the retention time and converts it to seconds. The retention time here is start RT not apex RT, right? How do we get apex RT of peaks?

@jpfeuffer
Copy link
Collaborator

jpfeuffer commented Jan 25, 2024

I think there might be a confusion between "scan start time" of a spectrum and "elution start time" of a peptide or peptide fragment.

I think you have to compare "RT" (= apex of peptide elution) from DIA-NN with "getRT()" (="scan [start] time") from pyopenms.
Correct me if I am wrong.

@WangHong007
Copy link

Hi @jpfeuffer, as we discussed before, we might need to pick the closest MS2 spectrum that has the precursor mass in its precursor range. Are you suggested to filter MS2 ions for every ion in DIA-NN report and then match the RT?

I used spectrum.getPrecursors()[0].getUnchargedMass() to get precursor mass, but how can I get the mass window of precursors (getMin and getMax are avaliable in spectrums but not in precursors)?

@jpfeuffer
Copy link
Collaborator

You can use p.getMZ() - p.getIsolationWindowLowerOffset()
Or "+ upper offset" accordingly.

I would use that as a test and see if it is really necessary. I would think that the closest spectrum to the RT apex that DIANN reports always has the correct isolation window.

@WangHong007
Copy link

WangHong007 commented Feb 22, 2024

@jpfeuffer @ypriverol After tesing in PXD026600, there are 49 of 26874 ions out of thier precursor mz range when using RT, and 53 of 26874 ions using RT.start.

@jpfeuffer
Copy link
Collaborator

Interesting. Maybe some precursor correction going on?
Are those ones that don't match close to the end of the isolation range? Or far off?

@WangHong007
Copy link

Their distance from their precursor mz window is distributed between 0 and 8.

@vdemichev
Copy link

I would take RT from DIA-NN report, and match it to the scan RT in mzML, as suggested by @jpfeuffer. This can be done regardless of anything in MS1 data and regardless of m/z tolerances - I would ignore those and only use for extracting the apex spectrum - but in this case I would use relatively wide tolerances, as raw mzML is not mass calibrated.

@WangHong007
Copy link

Hi @vdemichev, thanks for your comment and suggestion above all!

When matching the RT of MS2 in the mass spectrum with the RT of ions in the DIA-NN report, what error threshold do you suggest should be set to match? In the past, we looked for the closest one, but there was no error threshold.

@vdemichev
Copy link

What is expected is an ideal match (RT in minutes). What should also be the case is that the corresponding isolation window contians the precursor mass.

@WangHong007
Copy link

Hi @jpfeuffer @vdemichev, here we use the following method to obtain the PSM:
https://github.com/WangHong007/quantms/blob/74c3cd0de07f89e108ac84c6e02159ecd222b501/bin/diann_convert.py#L867-L888

In general, for each ion in report, we find the ions of MS2 that have precursor isolation window which contains the ion's precursor mz, and perform RT matching. Such traversal is extremely time consuming and would be appreciated if you have a more suitable method.

@ypriverol We have added --generate_psm as an option, do we need to keep RT matching the old way, or do we just leave PSM out of mztab?

@vdemichev
Copy link

@WangHong007 what if you just create an RT index for each run (RT bin -> file position mapping), and then query the spectrum corresponding to the closest RT?

@ypriverol
Copy link
Member Author

Hi @jpfeuffer @vdemichev, here we use the following method to obtain the PSM: https://github.com/WangHong007/quantms/blob/74c3cd0de07f89e108ac84c6e02159ecd222b501/bin/diann_convert.py#L867-L888

In general, for each ion in report, we find the ions of MS2 that have precursor isolation window which contains the ion's precursor mz, and perform RT matching. Such traversal is extremely time consuming and would be appreciated if you have a more suitable method.

@ypriverol We have added --generate_psm as an option, do we need to keep RT matching the old way, or do we just leave PSM out of mztab?

@WangHong007 can you generate one example DIA output if USIs. for all of us to test and visualize how they look.

@WangHong007
Copy link

@ypriverol
Copy link
Member Author

@WangHong007 I have searched in PRIDE Archive USI two of the USIs from the file:

Both of these spectra and peptide combination do not have a single ion annotated. Can you lookup how those USI were created, basically look for the scan in the mzML, the RT and the DIANN output. Check How the USI was created.

@WangHong007
Copy link

Hi all! Here are the several important indicators of these two USIs:

USI PEP q value PSM precursor isolation window presursor mz calculated mz RT in report RT in spectrum charge in report charge in spectrum
mzspec:PXD026600:RD139_Narrow_UPS1_0_1fmol_inj1:scan:102207:AAGYELGKDITLAMDCAASEFYK/3 0.00239137 3.87E-09 ms_run[1]:controllerType=0 controllerNumber=1 scan=102207 [838, 846] 842 842.0627 83.8397 85.88199 3 2
mzspec:PXD026600:RD139_Narrow_UPS1_0_1fmol_inj1:scan:46015:AENGEEPIKEHLLLTR/3 0.0287009 0.003265 ms_run[1]:controllerType=0 controllerNumber=1 scan=46015 [614, 622] 618 616.9987 36.8451 41.5181 3 2

@daichengxin
Copy link
Collaborator

Hi all, there are still very few ion matches in some PSMs when taking RT from DIA-NN report, and match it to closest the scan RT in mzML. 20ppm and 40 ppm fragment tolerence are used respectively. regardless of anything in MS1 data. The PXD026600 ions annotation are follow and sorted by Q.Value and PEP. Neutral Loss are considered.

PXD026600_USI_ions_matched_20PPM.zip
PXD026600_USI_ions_matched_40PPM.zip

More ions are matched in PSMs when increasing to 40 ppm. But there are still very few ion matches. A few examples:

  1. mzspec:PXD026600:RD139_Overlap_UPS1_2_5fmol_inj1:scan:15273:VDPPKKPLK/2 two ions matched
  2. mzspec:PXD026600:RD139_Overlap_UPS1_2_5fmol_inj1:scan:91676:LLNTGSLGGILTFR/2 three ions matched
  3. mzspec:PXD026600:RD139_Overlap_UPS1_2_5fmol_inj1:scan:114950:TTTTDDPLQVLQQVLDR/2 two ions matched

So Is this kind of number of fragment annotations normally? if the spectra MS2 we matched is the one that it suppose to be? @ypriverol @vdemichev @jpfeuffer

@jpfeuffer
Copy link
Collaborator

jpfeuffer commented Mar 18, 2024

Can you share an example project as small as possible, with the mzmls, diann_report, and whatever goes into your scripts (e.g. your mzml_statistics/summary files)?

@vdemichev
Copy link

One possibility: ions can be missing from specific scans. DIA-NN looks at the entire XICs around the apex, not just at the apex spectrum, and hence the latter does not necessarily include all ions, although this is most often the case. Also of course any issues with mass calibration can play a role here. You can always verify what exactly DIA-NN is matching to identify the peptide using the --vis command.

@ypriverol
Copy link
Member Author

@vdemichev:

What you are saying is that you can probably have more than one MS2 for a given Peptide, where some ions are in one MS2 and others in another one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dia analysis documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

5 participants