Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/bruker data #275

Merged
merged 52 commits into from
Sep 29, 2023
Merged

Feature/bruker data #275

merged 52 commits into from
Sep 29, 2023

Conversation

jspaezp
Copy link

@jspaezp jspaezp commented Aug 6, 2023

TODO

  • Add support for non-archived .d files
  • Docs
  • inline docs
  • Make input naming ocnsistent (right now a lot of places are called mzml and shoudl be ms_file)

PR checklist

  • This comment contains a description of changes (with reason).
    • This PR adds support for .d 'files' and passing them directly to DIA-NN for DIA search. As well as related changes to support intermediate files not being exclusively mzMLs
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/quantms branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>). Note locally this does not run anything.
  • Usage Documentation in docs/usage.md is updated. Depends on Important! Template update for nf-core/tools v2.9 nf-core/quantms#103
  • Output Documentation in docs/output.md is updated. Depends on Important! Template update for nf-core/tools v2.9 nf-core/quantms#103
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).
    • Should a 'Contribtors' section be added and list myself in it?

@sonatype-lift
Copy link

sonatype-lift bot commented Aug 6, 2023

Sonatype Lift is retiring

Sonatype Lift will be retiring on Sep 12, 2023, with its analysis stopping on Aug 12, 2023. We understand that this news may come as a disappointment, and Sonatype is committed to helping you transition off it seamlessly. If you’d like to retain your data, please export your issues from the web console.
We are extremely grateful and thank you for your support over the years.

📖 Read about the impacts and timeline

bin/diann_convert.py Outdated Show resolved Hide resolved
bin/diann_convert.py Outdated Show resolved Hide resolved
@ypriverol
Copy link
Member

@jspaezp this PR from nf-core templates nf-core#103 would be good to merged into your branch. While it creates some unnecessary work, it is better to integrate in an early stage than later to avoid the need to change more files when we go for release.

@jspaezp
Copy link
Author

jspaezp commented Sep 10, 2023

  • Most of the datasets in ProteomeXchange has the extension .d.zip which means that we should support also that format apart from .d.gz

that sounds good! the only issue here is that the current base docker image I am using for it does not have zip built-in. solutions: A: build and host a new image with zip, B add to the documentation the note saying that if zip is desired, the container needs to be over-written (pass the burden of zip to the user).

This would be pretty much the dockerfile for reference.

FROM continuumio/miniconda3:23.5.2-0-alpine
RUN apk add --update zip

EDIT: https://github.com/jspaezp/miniconda-alpine-zip/pkgs/container/miniconda-alpine-zip I created the image here and will be use it for now, we can figure out any alternatives later

This is great, thanks a lot!

subworkflows/local/file_preparation.nf Outdated Show resolved Hide resolved
bin/diann_convert.py Show resolved Hide resolved
bin/diann_convert.py Outdated Show resolved Hide resolved
jspaezp and others added 3 commits September 10, 2023 01:50
* changed files to paths in final diann analysis
* updated decompression to support zip
* added exception when the decompressed file already matches the required pattern
@jspaezp
Copy link
Author

jspaezp commented Sep 19, 2023

I am happy to let you know that I think this is ready to have a final review/merge.

Since the last update I:

  1. Added support for .zip files, I still do not like them and they do not offer any significant compression advantage, therefore PRIDE should suggest people to use an archival-uncompressed method.
  2. Added correct handling of file paths when decompressing the files.
  3. Made sure pmultiqc worked (thank you for the heavy lifting on that end!)

I am testing it using the following data:

Input sdrf:

source name	characteristics[organism]	characteristics[organism part]	characteristics[cell type]	characteristics[disease]	characteristics[cell line]	characteristics[biological replicate]	assay name	comment[technical replicate]	comment[data file]	comment[fraction identifier]	comment[label]	comment[cleavage agent details]	comment[instrument]	comment[proteomics data acquisition method]	comment[modification parameters]	comment[modification parameters]	comment[precursor mass tolerance]	comment[fragment mass tolerance]	comment[file uri]	factor value[phenotype]
Sample 1	homo sapiens	cancer	cells	not applicable	not applicable	1	Run 1	1	3817_TIMS2_2col-80m_37_1_Slot1-46_1_4768.d.zip	1	AC=MS:1002038;NT=label free sample	AC=MS:1001251;NT=Trypsin	AC=MS:1003231;NT=TimsTOF SCP	NT=Data-Independent Acquisition;AC=NCIT:C161786	NT=Oxidation; MT=Variable; TA=M; AC=Unimod:35	NT=Carbamidomethyl; MT=Fixed; TA=C; AC=Unimod:4	15 ppm	15 ppm	https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/05/PXD037164/3817_TIMS2_2col-80m_37_1_Slot1-46_1_4768.d.zip	A
Sample 2	homo sapiens	cancer	cells	not applicable	not applicable	2	Run 2	1	3817_TIMS2_2col-80m_38_2_Slot1-47_1_4816.d.zip	1	AC=MS:1002038;NT=label free sample	AC=MS:1001251;NT=Trypsin	AC=MS:1003231;NT=TimsTOF SCP	NT=Data-Independent Acquisition;AC=NCIT:C161786	NT=Oxidation; MT=Variable; TA=M; AC=Unimod:35	NT=Carbamidomethyl; MT=Fixed; TA=C; AC=Unimod:4	15 ppm	15 ppm	https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/05/PXD037164/3817_TIMS2_2col-80m_38_2_Slot1-47_1_4816.d.zip	A
Sample 3	homo sapiens	cancer	cells	not applicable	not applicable	1	Run 1	1	3817_TIMS2_2col-80m_13_1_Slot1-22_1_4772.d.zip	1	AC=MS:1002038;NT=label free sample	AC=MS:1001251;NT=Trypsin	AC=MS:1003231;NT=TimsTOF SCP	NT=Data-Independent Acquisition;AC=NCIT:C161786	NT=Oxidation; MT=Variable; TA=M; AC=Unimod:35	NT=Carbamidomethyl; MT=Fixed; TA=C; AC=Unimod:4	15 ppm	15 ppm	https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/05/PXD037164/3817_TIMS2_2col-80m_13_1_Slot1-22_1_4772.d.zip	B
Sample 4	homo sapiens	cancer	cells	not applicable	not applicable	2	Run 2	1	3817_TIMS2_2col-80m_14_1_Slot1-23_1_4690.d.zip	1	AC=MS:1002038;NT=label free sample	AC=MS:1001251;NT=Trypsin	AC=MS:1003231;NT=TimsTOF SCP	NT=Data-Independent Acquisition;AC=NCIT:C161786	NT=Oxidation; MT=Variable; TA=M; AC=Unimod:35	NT=Carbamidomethyl; MT=Fixed; TA=C; AC=Unimod:4	15 ppm	15 ppm	https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/05/PXD037164/3817_TIMS2_2col-80m_14_1_Slot1-23_1_4690.d.zip

Nextflow config

// Pipeline Parameters
params {
    // Input options
    input = 'THAT_SDRF.SDRF.TSV'
    database = 'SOMEHUMANFASTA.fasta'

    skip_post_msstats = true
    add_decoys = true
    acquisition_method = 'dia'

    // DIA-NN related
    mass_acc_automatic = false
    diann_normalize = false
    diann_speclib = ''

}

@jspaezp jspaezp marked this pull request as ready for review September 19, 2023 17:01
@jspaezp jspaezp changed the title [DO NOT MERGE, WORK IN PROGRESS] Feature/bruker data Feature/bruker data Sep 19, 2023
@ypriverol ypriverol requested review from daichengxin and removed request for benpullman and WangHong007 September 23, 2023 06:48
modules/local/diannsummary/main.nf Outdated Show resolved Hide resolved
modules/local/dotd_to_mqc/main.nf Outdated Show resolved Hide resolved
modules/local/pmultiqc/main.nf Show resolved Hide resolved
modules/local/tdf2mzml/main.nf Outdated Show resolved Hide resolved
@@ -156,6 +156,9 @@ params {
add_triqler_output = false
quantify_decoys = false

// Bruker data
convert_dotd = false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice, so with this we could still force conversion to mzml if ever needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed that is how it would work! the current implementation uses tdf2mzml to generate the mzML from .d files and that gets passed to DIA-NN. I have not really tested that it gives good results (In my experience it doesnt ... ) but it would be a possibility.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! Yes it is perfectly fine how it is, as a last resort/fallback.

Out of curiosity: According to your experience, what goes wrong when using converted mzmls? Is it just runtime/storage or also quality of the results?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well ... I have not tried in a while, but If my memory is not failing me it was all of the former to different degrees depending on how the mzml was generaged.

  1. If the mzml was generated with arguments/sotware/modes that collapsed the mobility dimension ("centroided over it"), results tend to be poor, I believe this happens becuase the noise that can be easily identified as noise by resolving on the mobility dimension gets over-represented, which leads to "virtually poor" scan quality.
  2. In the case of software that does not collapse the mobility dimension, the first thing is that files end up being absoluteley massive. Since each "scan" along the mobility dimension ends up being hundreds of scans in the mzml, the resulting file is the equivalent of having an instrument scanning at +1_000 Hz. Which makes any software reading it very slow and disk usage absolutely horrendous.

I have not tried in a while and we have been exploring different ways to have a good intermediate file format ... the search remains :P

pyproject.toml Show resolved Hide resolved
logger = getLogger(__name__)

SECOND_RESOLUTION = 5
MQC_YML = """
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?
What is the difference to the multiqc yml file in the repo?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds as part of the qc information on the chromatogram and total ions that is extracted directly from the .d files in a distributed way.

This specific yml "file" is not used unless multiqc is run in a standalone way (and it lets me test it in a much faster way than having it in a sharded way as part of the pipeline)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I guess that's fine. Might just be awkward if the two versions get out-of-sync at some point. Not sure how likely that is.

I didn't check, but is this documented? Maybe add a comment about its "standalone usage" to that variable or the function that writes out its contents.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am adding a couple of comments and more information on the module docstring for the standalone usage (outside of the quantms pipeline).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on f1b0bf7

@ypriverol ypriverol self-requested a review September 24, 2023 05:37
Copy link
Member

@ypriverol ypriverol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jspaezp check some of my comments, solutions for which containers must be used to be able to deploy using conda/docker/singularity.

Let me know what do you think.

modules/local/decompress_dotd/main.nf Outdated Show resolved Hide resolved
modules/local/dotd_to_mqc/main.nf Outdated Show resolved Hide resolved
@jspaezp
Copy link
Author

jspaezp commented Sep 28, 2023

Hello @ypriverol, @jpfeuffer, @benpullman and @WangHong007 Thank you very much for the review and suggestions!

I believe I have incorporated all the suggestions and notes. I am happy to take into account any other suggestions you might have and to work together in the future!

Kindest wishes,
Sebastian

@ypriverol ypriverol self-requested a review September 29, 2023 05:54
@ypriverol ypriverol merged commit f48b65c into bigbio:dev Sep 29, 2023
15 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Question] mz calculation when converting DIANN outputs
5 participants