Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single cell analysis performance with worse results #287

Closed
daichengxin opened this issue Sep 19, 2023 · 69 comments
Closed

Single cell analysis performance with worse results #287

daichengxin opened this issue Sep 19, 2023 · 69 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed high-priority question Further information is requested

Comments

@daichengxin
Copy link
Collaborator

daichengxin commented Sep 19, 2023

Description of the bug

Analyzing single cell dataset PXD016291. The results folder can be found in here. The MBR is enabled. The results looks quite different from original paper. The less proteins are quantified in single cell samples than original results. And more proteins are misquantified in blank sample than original results. The original paper used MaxQuant with as follows parameters:

All raw files were processed using MaxQuant (version 1.6.3.3) for feature detection, database searching, and protein/peptide quantification. MS/MS spectra were searched against the UniProtKB/Swiss-Prot human database (downloaded on October 26, 2018, containing 20 397 reviewed sequences). N-Terminal protein acetylation and methionine oxidation were selected as variable modifications. Carbamidomethylation of cysteine residues was set as a fixed modification. The peptide mass tolerances of the first search and main search (recalibrated) were <30 and 5 ppm, respectively. The minimum peptide length was six amino acids, and the maximum peptide mass was 4600 Da. Only two missed cleavages were allowed for each peptide. The second peptide search was activated to identify coeluting and cofragmented peptides from one MS/MS spectrum. Both peptides and proteins were filtered with a maximum false discovery rate (FDR) of 0.01. The MBR feature with a matching window of 0.7 min and an alignment window of 20 min, was activated. Label-free quantitation (LFQ) calculations were performed separately in each parameter group containing similar cell loadings. Both unique and razor peptides were selected for protein quantification. Other unmentioned parameters were the MaxQuant default settings. Potential contaminants and reverse sequences were filtered out.

QQ截图20230919173108

Command used and terminal output

$ nextflow run /hps/nobackup/juan/pride/reanalysis/quantms/main.nf -c /hps/nobackup/juan/pride/reanalysis/quantms/nextflow.config -profile ebislurm --input PXD016921-MBR.sdrf.tsv --search_engines comet,msgf --root_folder /hps/nobackup/juan/pride/reanalysis/single-cell/PXD016921/ --local_input_type raw --outdir PXD016921-MBR --database /hps/nobackup/juan/pride/reanalysis/multiomics-configs/databases/Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta --protein_level_fdr_cutoff 0.01 --psm_level_fdr_cutoff 0.01 --quantify_decoys true --transfer_ids mean --targeted_only false --skip_post_msstats false --enable_pmultiqc true -resume

Relevant files

Original MaxQuant results
quantms results

System information

No response

@daichengxin daichengxin added the bug Something isn't working label Sep 19, 2023
@ypriverol ypriverol added enhancement New feature or request help wanted Extra attention is needed question Further information is requested and removed bug Something isn't working labels Sep 19, 2023
@ypriverol
Copy link
Member

@daichengxin can you add some description about How the data was analyzed with MaxQuant and the original paper.

@jpfeuffer
Copy link
Collaborator

Thank you for the analysis. Very interesting. Can we maybe run the analysis with looser FDR cutoffs to see if it's more an identification problem or a quantification problem?

@ypriverol
Copy link
Member

@jpfeuffer a couple of ideas:

In single cell, looks for me that MBR would be doing most of the job for the single cell runs. If you see we ID most of the peptides in the 20 and 100 samples (complex samples), more than MQ. Those samples are used as "channels" to look for poor signals from the single cell files. My point is the following:

The MBR feature with a matching window of 0.7 min and an alignment window of 20 min, was activated.

Do you think this 20 min window that they use for MBR in MQ can be the difference with us?

Do you think we need to do something in the experimental design.? For the MBR we are annotating every sample as a biological replicate to enable MBRs.

@jpfeuffer
Copy link
Collaborator

jpfeuffer commented Sep 19, 2023

Yes, if we find out that it's not the identification performance we will check quantification.
But if you have only 100 proteins that pass FDR, you cannot quantify more than that and it is useless to check quantification.

@ypriverol
Copy link
Member

@daichengxin can we check in the output of MQ of the original project how many peptides from the single cell samples are quantify/identified using MBR, I think MQ use a notation like No MS/MS or something like that.

@daichengxin
Copy link
Collaborator Author

daichengxin commented Sep 19, 2023

More than half of the unique peptides originated with the MBR.
image

@ypriverol
Copy link
Member

This is what I thought, see @jpfeuffer The majority of peptides are not coming from IDs but from the MBR, then we should see how we fail to do proper MBR in our side, my previous two comments.

@jpfeuffer
Copy link
Collaborator

Still.. if they don't pass FDR, there is nothing to quantify. So we have to wait for the ID analysis

@daichengxin
Copy link
Collaborator Author

Almost all of the peptides quantified only in MQ are from MBR. However, for quantms, these peptides also are only quantified in other files rather than from same file. it suggest we fail to do proper MBR, right?

image
image
image
image

@jpfeuffer
Copy link
Collaborator

jpfeuffer commented Sep 21, 2023

Can you please add the command you used for quantms?

@daichengxin
Copy link
Collaborator Author

nextflow run /hps/nobackup/juan/pride/reanalysis/quantms/main.nf -c /hps/nobackup/juan/pride/reanalysis/quantms/nextflow.config -profile ebislurm --input PXD016921-MBR.sdrf.tsv --search_engines comet,msgf --root_folder /hps/nobackup/juan/pride/reanalysis/single-cell/PXD016921/ --local_input_type raw --outdir PXD016921-MBR --database /hps/nobackup/juan/pride/reanalysis/multiomics-configs/databases/Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta --protein_level_fdr_cutoff 0.01 --psm_level_fdr_cutoff 0.01 --quantify_decoys true --transfer_ids mean --targeted_only false --skip_post_msstats false --enable_pmultiqc true -resume @jpfeuffer

@jpfeuffer
Copy link
Collaborator

I don't know. I can't see any signs of transfer in the Proteomicslfq logs.
Do you have the call to proteomicslfq?

@ypriverol
Copy link
Member

The command with the relapsed FDR is:

nextflow run /hps/nobackup/juan/pride/reanalysis/quantms/main.nf -c /hps/nobackup/juan/pride/reanalysis/quantms/nextflow.config -profile ebislurm --input PXD016921-MBR.sdrf.tsv --search_engines comet,msgf --root_folder /hps/nobackup/juan/pride/reanalysis/single-cell/PXD016921/ --local_input_type raw --outdir PXD016921-MBR --database /hps/nobackup/juan/pride/reanalysis/multiomics-configs/databases/Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta --protein_level_fdr_cutoff 0.15 --psm_level_fdr_cutoff 0.15 --quantify_decoys true --transfer_ids mean --targeted_only false --skip_post_msstats false --enable_pmultiqc true -resume

Here the results: http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR/

@jpfeuffer
Copy link
Collaborator

I found the plfq command. Looks correct.
One thing is, that we only transfer_ids if they occur in more than 50% of the samples. I don't know how MQ does it.

@daichengxin
Copy link
Collaborator Author

@jpfeuffer @ypriverol ran the data with a more relaxed FDR 0.15 at Protein and PSM level. However, the number of quantified peptides did not increase significantly. Similarily, almost all of the peptides quantified only in MQ are from MBR. For quantms, these peptides also are only quantified in other files rather than from same file.
image

@jpfeuffer
Copy link
Collaborator

jpfeuffer commented Sep 21, 2023

I think if we loosen the percentage of aligned samples required to trigger requantification, we should think about an FDR approach for transferred quantification first.
E.g. shifted decoy EICs

@jpfeuffer
Copy link
Collaborator

You can use the new container tomorrow to play around with a parameter that controls this proportion.
You need to expose it quickly in the pipeline.

@ypriverol
Copy link
Member

Can you let me know what is the parameter that we will need to use.

@jpfeuffer
Copy link
Collaborator

id_transfer_threshold

@ypriverol
Copy link
Member

What is the default value?

@timosachsenberg
Copy link

timosachsenberg commented Oct 6, 2023

like

cut -d',' -f 2,11 msstats.csv |  sed 's/([^)]*)//g' | sort | uniq | grep Singlecell4 | wc -l

?
Do you have a command?

These are my values with filtering out modified ones:

Blank 97
Singlecell1 3827
Singlecell2 3172
Singlecell3 3061
Singlecell4 4533
HeLa 8848
100HeLa 11357

@jpfeuffer
Copy link
Collaborator

jpfeuffer commented Oct 7, 2023

I think he means unique as in "unique for a protein". But IIRC we tried to only export unique-to-group peptides to MSstats so Timo's approach might be correct.

@jpfeuffer
Copy link
Collaborator

You need to check what MQ exports and how it defines uniqueness if you want to compare by #identifications. You can check by picking some of the group-unique peptides and see how MQ defines them.
Ideally one would just compare the number of features with an ID though. I have no idea if you can do that with MQ.

@daichengxin
Copy link
Collaborator Author

Here is my script. And i try to run @timosachsenberg command, but got different values from @timosachsenberg.

def sub_mod(peptide):
    peptide = peptide.replace(".", "")
    peptide = re.sub(r"\(.*?\)", "", peptide)
    return peptide

quantms = pd.read_csv("./PXD016921MBRnewSVM/PXD016921-MBR.sdrf_openms_design_msstats_in.csv", sep=',', header=0)
quantms = quantms[-quantms['ProteinName'].str.contains("DECOY_|CONTAMINANT_|REV_|BOVIN")]
quantms["sequence"] = quantms.apply(lambda x: sub_mod(x["PeptideSequence"]), axis=1)
msstats_in_data = quantms.groupby('sequence').filter(lambda x: len(set(x["ProteinName"])) == 1)
print(quantms.groupby("Reference")["sequence"].nunique())
print(quantms.groupby("Reference")["ProteinName"].nunique())

image

@jpfeuffer
Copy link
Collaborator

jpfeuffer commented Oct 7, 2023

How is your MQ script?
And can we put the MQ data next to the quantms results for both UPS and this one?
Edit: Ah I see the MQ files in the beginning of this thread.

@daichengxin
Copy link
Collaborator Author

MQ = pd.read_csv("./MQsearchresults_Single_cell/txt/peptides.txt", sep="\t")
# MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+") & (MQ["Unique (Groups)"] == "yes")]
MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+")]
print(MQ[-MQ["Identification type Single cell 1"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 2"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 3"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 4"].isna()]["Sequence"].nunique())

MQ = pd.read_csv("./MQsearchresults_Single_cell/txt/proteinGroups.txt", sep="\t")
MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+")]
# MQ = MQ[MQ["Unique peptides"] > 0]
print(MQ[-MQ["Identification type Single cell 1"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 2"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 3"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 4"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Blank"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type 20 HeLa cells"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type 100 HeLa cells"].isna()]["Protein IDs"].nunique())


@daichengxin
Copy link
Collaborator Author

@jpfeuffer
Copy link
Collaborator

I think if you wanted more IDs, one should have used 0.1 and something that is less than 0.75, e.g. 0.5

@jpfeuffer
Copy link
Collaborator

jpfeuffer commented Oct 7, 2023

MQ = pd.read_csv("./MQsearchresults_Single_cell/txt/peptides.txt", sep="\t")
# MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+") & (MQ["Unique (Groups)"] == "yes")]
MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+")]
print(MQ[-MQ["Identification type Single cell 1"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 2"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 3"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 4"].isna()]["Sequence"].nunique())

MQ = pd.read_csv("./MQsearchresults_Single_cell/txt/proteinGroups.txt", sep="\t")
MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+")]
# MQ = MQ[MQ["Unique peptides"] > 0]
print(MQ[-MQ["Identification type Single cell 1"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 2"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 3"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 4"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Blank"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type 20 HeLa cells"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type 100 HeLa cells"].isna()]["Protein IDs"].nunique())


Why did you comment the filter for Unique (groups)?
This sounded like the right way to compare to our (unfiltered) results.

@daichengxin
Copy link
Collaborator Author

filtered results (unique for protein groups) @jpfeuffer

image

@jpfeuffer
Copy link
Collaborator

jpfeuffer commented Oct 8, 2023

Ok can we compare it with different quantms cutoffs now? Especially something lower than 0.9 for unidentified.

And without .filter(lambda x: len(set(x["ProteinName"])) == 1) for the quantms results. This should be the same as "Group (unique)" then.

@jpfeuffer
Copy link
Collaborator

jpfeuffer commented Oct 8, 2023

I don't think we need to compare protein groups for now. They will (almost) always be less if the peptides are less. I'd rather see multiple cutoffs in the table.

@timosachsenberg
Copy link

Here is my script. And i try to run @timosachsenberg command, but got different values from @timosachsenberg.

This is interesting. I used the old id files that Yasset provided in the original analysis and ran them through the ProteomicsLFQ tool. I got numbers much closer to MQ. I am currently traveling but would like to dig into the exact parameters used a bit more.

@daichengxin
Copy link
Collaborator Author

image

@jpfeuffer
Copy link
Collaborator

Thank you very much for the quick table. Surprisingly very few changes. This must mean that all unidentified Features have scores either below 0.6 or above 0.9.
Maybe an FDR approach is really really needed

@jpfeuffer
Copy link
Collaborator

@timosachsenberg exact parameters are always visible in the pipeline_info folder

@timosachsenberg
Copy link

That's the command I used. FDR settings seem the be the same. Do you see anything I missed?

/ceph/ibmi/abi/projects/sachsenb/OpenMS/openms-build/bin/ProteomicsLFQ \
-in Blank.mzML Singlecell1.mzML Singlecell2.mzML Singlecell3.mzML Singlecell4.mzML 20HeLacells.mzML 100HeLacells.mzML \
-ids Blank_consensus_fdr_filter.idXML Singlecell1_consensus_fdr_filter.idXML Singlecell2_consensus_fdr_filter.idXML Singlecell3_consensus_fdr_filter.idXML Singlecell4_consensus_fdr_filter.idXML 20HeLacells_consensus_fdr_filter.idXML 100HeLacells_consensus_fdr_filter.idXML \
    -design PXD016921-MBR.sdrf_openms_design.tsv \
    -out_cxml notransfer_1.0_1000_star.consensusXML \
    -out_msstats notransfer_1.0_1000_star.csv \
    -out notransfer_1.0_1000_star.mzTab \
    -fasta Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta \
    -threads 40 \
    -Seeding:intThreshold 1000.0 \
    -protein_inference aggregation \
    -quantification_method feature_intensity \
    -feature_with_id_min_score 0.25 \
    -feature_without_id_min_score 0.75 \
    -targeted_only false \
    -mass_recalibration false \
    -protein_quantification unique_peptides \
    -alignment_order star \
    -PeptideQuantification:quantify_decoys \
    -psmFDR 0.01 \
    -proteinFDR 0.01 \
    -picked_proteinFDR true \
    2>&1 | tee notransfer_1.0_1000_star.txt

@jpfeuffer
Copy link
Collaborator

jpfeuffer commented Oct 9, 2023

No, looks the same as the 0.25_0.75 directory.
@timosachsenberg can you maybe download the Msstats file and run your command on your computer?

https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR-newSVM-0.25-0.75/proteomicslfq/

@jpfeuffer
Copy link
Collaborator

Maybe the idxmls changed meanwhile? Changes in Sage or percolator adapter? No idea.

@timosachsenberg
Copy link

No, looks the same as the 0.25_0.75 directory. @timosachsenberg can you maybe download the Msstats file and run your command on your computer?

https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR-newSVM-0.25-0.75/proteomicslfq/

I can confirm that these outputs match the ones Dai reported. Will now check if -Seeding:intThreshold 1000.0 makes such a difference.

@ypriverol
Copy link
Member

It looks like we are using the default parameter for -Seeding:intThreshold. Our command now for proteomicsLFQ is:

    ProteomicsLFQ \\
        -threads ${task.cpus} \\
        -in ${mzml_sorted.join(' ')} \\
        -ids ${id_sorted.join(' ')} \\
        -design ${expdes} \\
        -fasta ${fasta} \\
        -protein_inference ${params.protein_inference_method} \\
        -quantification_method ${params.quantification_method} \\
        -targeted_only ${params.targeted_only} \\
        ${feature_with_id_min_score} \\
        ${feature_without_id_min_score} \\
        -mass_recalibration ${params.mass_recalibration} \\
        -protein_quantification ${params.protein_quant} \\
        -alignment_order ${params.alignment_order} \\
        ${decoys_present} \\
        -psmFDR ${params.psm_level_fdr_cutoff} \\
        -proteinFDR ${params.protein_level_fdr_cutoff} \\
        -picked_proteinFDR ${params.picked_fdr} \\
        -out_cxml ${expdes.baseName}_openms.consensusXML \\
        -out ${expdes.baseName}_openms.mzTab \\
        ${msstats_present} \\
        ${triqler_present} \\
        $args \\
        2>&1 | tee proteomicslfq.log

@timosachsenberg
Copy link

timosachsenberg commented Oct 9, 2023

Yep, main reason seems to be the intensity threshold.
I ran it with 0.1 and 0.9 and get:

intensity > 1000
8
2964
2068
3114
3343
7671
10669

intensity > 10000
7
1563
1408
2358
3334
7619
10503

Dai would it be possible to generate the results for different thresholds using the new parameter?

@jpfeuffer
Copy link
Collaborator

Numbers look really good. I wonder if MQ has a min. feature ity. cutoff.

I guess we need to see the effect on UPS

@daichengxin
Copy link
Collaborator Author

1000 intensity threshold:

image

@ypriverol
Copy link
Member

Looks like the rigth parameters are close to 0.20 - 0.80. @jpfeuffer @timosachsenberg

@ypriverol
Copy link
Member

I will close the current issue in favor of #303 . All discussions about MBR LFQ must be move to that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed high-priority question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants