Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(proteome-discoverer): Add validation for input columns #87

Merged
merged 2 commits into from
Feb 15, 2024

Conversation

tonywu1999
Copy link
Contributor

@tonywu1999 tonywu1999 commented Jan 24, 2024

Motivation and Context

We encountered this issue where the PD to MSstatsTMT converter is having problems with returning a useful validation message. In .mergeAnnotation, it spits out an error message saying to inspect the annotation file, but in reality, it's the input PD file that is missing columns.

Changes

  • Added column name validation via a separate helper function .validatePDTMTInputColumns to the PD to MSstatsTMT clean function (raw file cleaning occurs before annotation file merging)
    • Checks that "ProteinAccessions", "XProteins", "AnnotatedSequence", and "SpectrumFile" columns exist from PD input.
    • Refactored channels column validation to be in the helper function.
  • Also addressed this issue where PD 3.1+ changed column names from "# Proteins" and "# Protein Groups" to "Number of Proteins" and "Number of Protein Groups" respectively. Modified code to handle using both old and new column names.

Testing

  • Added unit tests for clean PD for TMT function and verified happy path is still successful
  • Added unit tests where columns are missing and verified error occurs with the expected message
  • Added unit tests for where column names are changed and verified happy path is successful.

Checklist Before Requesting a Review

  • I have read this repository's contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

@mstaniak
Copy link
Contributor

Good start, but let's use this occasion to deal with column changes between PD versions first.
@devonjkohler do we have examples of data from current and older versions of PD? I'm pretty sure for example column that denotes number of proteins in a protein group changed its name and users had problems with it. If not, I'll try to find it in PD docs.

@tonywu1999
Copy link
Contributor Author

tonywu1999 commented Feb 2, 2024

@mstaniak

I took a look at the PD docs but couldn't find documentation discussing the new schema (where numProteins is X.Proteins or X.Protein.Groups), but got documentation regarding the old schema (e.g. docs page 317 where numProteins is "# Protein Groups" or "# Proteins"). Devon pointed me to the MSstats Google Drive, where I found some datasets with the old format as well.

I think next step is to modify the code such that it can determine whether to use the "# Protein Groups" or "# Proteins" columns vs "X.Protein.Groups" or "X.Proteins", likely based on the input dataset provided. I'll add unit tests as well verifying the converter works as expected with a dataset from an older version of PD.

@tonywu1999
Copy link
Contributor Author

tonywu1999 commented Feb 13, 2024

As discussed offline, this issue illustrates that PD 3.1+ changed column names from "# Proteins" and "# Protein Groups" to "Number of Proteins" and "Number of Protein Groups" respectively. This change aligns with what we see in the PD documentation.

This issue also contains PD sample data where the column names also changed to "Number of Proteins" and "Number of Protein Groups".

@tonywu1999 tonywu1999 merged commit 13127c9 into master Feb 15, 2024
1 check passed
@tonywu1999 tonywu1999 deleted the feature-annotation-validation branch March 7, 2024 17:58
mstaniak pushed a commit that referenced this pull request Apr 29, 2024
* refactor(proteome-discoverer): Add validation for input columns

* Allow PD to MSstatsTMT converter to handle both old and new column names for # Proteins and # Protein Groups
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants