This project provides workflow to prepare your own training datasets from raw files and convert them into tensors. Those tensors are input for machine learning architecure. The architecture can be build using Tensorflow or Pytorch framework.
Here we are predicting full MSMS spectrum as set of (Mi,Ii) where Mi is the mass of the peak and Ii is the intenisty of the peak. For the BiGRU model, the MSMS spectrum is presented as b/y ion series discribed in original Prosit paper.
Two machine learning architectures were demonstrated.
- Transformer: Predicts full MSMS spectrum using transformer architecture
- BiGRU: Predicts y-, b- ion intensities using BiGRU architecture
To best test and experience usage of the software, we recommend to use docker environment at the beginning. All the software dependencies were pre-installed in the docker image, while model weight and example data were provided in capsule folder.
- Docker Community Edition (CE)
- nvidia-container-runtime for code that leverages the GPU
- NVIDIA Container Toolkit to use GPU in docker container.
- [GLIBCXX] Updated GCC compiler, version >= 3.4.29.
- Optional: Conda/Mamba installed
- Git clone the repo, download the pre-trained model
pretrained_models.zip
andexample.zip
from FigShare, create a data folderCoSpred/data
, and store two zip files there.
docker pull xuel12pfizer/cospred:v0.3
The provided dockerfile will allow you to reproduce the workflow and results published by the author on your local machine by following the instructions below.
If there's any software requiring a license that needs to be run during the build stage, you'll need to make your license available.
In your terminal, navigate to the CoSpred
folder with the example data in data
that you've extracted before, and execute the following command:
docker build . --tag cospred_docker -f conf/requirements_cospred_cuda12_gpu_py39.yml
This step will recreate the environment (i.e., the Docker image) locally, fetching and installing any required dependencies in the process. If any external resources have become unavailable for any reason, the environment will fail to build. Note that in this example, the docker image name is
cospred_docker
which will be refered in the following session.
Install all the packages given in conf/requirements_cospred_*.yml
in a virtual environment.
conda env create -n cospred_cuda12_gpu_py39 -f conf/requirements_cospred_cuda12_gpu_py39.yml
Following is for the purpose of experiencing the end-to-end workflow. Presumably the data example.zip
and model pretrain_model.zip
have been kept in data
folder that you've downloaded from FigShare. Execute the following command, run the docker container using the image just built in the previous session named cospred_docker
, adjust parameters as needed (e.g. fot the machine that doesn't have GPU, remove the option flag --gpus all
; for some system has restriction on shared memory for GPU, adjust --shm-size
flag; or anything specific for your docker environment).
docker run --platform linux/amd64 --rm --gpus all \
--volume "$PWD/data":/data \
--volume "$PWD/results":/results \
cospred_docker bash "scripts/run_training.sh"
docker run --platform linux/amd64 --rm --gpus all \
--volume "$PWD/data":/data \
--volume "$PWD/results":/results \
cospred_docker bash "scripts/run_prediction.sh"
- When finished, the final results will be stored in
results
folder.
For advanced usage, the following are the step-by-step guides for fine-grain control modular execution of CoSpred.
- OPTION 1: Docker with interactive mode
docker run --platform linux/amd64 --rm -it --gpus all \
--volume "$PWD/data":/data \
--volume "$PWD/results":/results \
--shm-size=32g \
cospred_docker
- OPTION 2: Viturtual environment
conda activate cospred_cuda12_gpu_py39
Once done the environment setup, navigate to CoSpred
working directory, move forward to following steps.
- Main configuration regarding file location and preprocessing parameters could be found and modified in
params
folder. - For all the following modules, log files could be found under
prediction
folder. - For BiGRU based training parameters and model setup could be found in JSON files under
model_spectra
folder. Editmodel_construct.py
to modify architecture and generate compatiblemodel.json
file. Two examplesmodel_byion.json
andmodel_fullspectrum.json
were provided.
python model_construct.py
- For transformer based model setup, users could directly modify
cospred_model/model/transformerEncoder.py
, as welltrain_transformer
module intraining_cospred.py
.
- Create PSM file with identified peptides from softwares like Proteome Discoverer corresponding to each rawfiles from the experiment.
- Convert rawfiles into mzml and mgf
Msconvert can be run using GUI version of the software on windows computer or can use Docker on linux machine. We recommend to run MSCovert in Windows GUI. At the end, assuming files were generated, *.mzML
and *.mgf
. Keep these two type of files in folder data/example/mzml
and data/example/mgf
respectively, together with example_PSMs.txt
and example_InputFiles.txt
got from Proteome Discoverer in the folder data/example
. Note that all file and folders names could be defined in the params/constants_location.py
.
-
OPTION 1: The MGF file doesn't contain sequence information
- Split the dataset into train and test set. (About 15mins for 300k spectra)
20% spectra will be randomly selected for test by default, which could be modified in the script.
example_train.mgf
andexample_test.mgf
will be generated from this step.rawfile2hdf_byion.py
(preparing dataset with b/y ion annotation) andrawfile2hdf_cospred.py
(preparing dataset for full spectrum representation) are the scripts for this purpose. (2019 Macbook Pro, about 2 minitues for the example dataset)python rawfile2hdf_cospred.py -w split
- OPTION 1.1: Pair database search result with MGF spectrum, annotate B and Y ion for MSMS spectrum
Pyteomics is used to parse annotations of y and b ions and their fragments charges from MZML and MGF, and report to annotated MGF files for downstream spectrum prediction/viewing application. Note that to parse the input file correctly, you will likely need to adjust regex routine (in the
reformatMGF
function withinio_cospred.py
) according to the specific MGF format you are using. (2019 Macbook Pro, about 1 hour for the example dataset)python rawfile2hdf_byion.py -w train python rawfile2hdf_byion.py -w test
- OPTION 1.2: Pair database search result with MGF spectrum, reformat to full MSMS using bins. (2019 Macbook Pro, about 2 minitues for the example dataset)
python rawfile2hdf_cospred.py -w train python rawfile2hdf_cospred.py -w test
-
OPTION 2: For MGF file with assigned peptide sequence (e.g.
example.mgf
), reformat to full MSMS using bins. Note: you may need to reformat MGF so that peptide sequence representation is compatible for downstream.
python mgf2hdf_cospred.py -w reformat # Reformat the MGF for CoSpred workflow.
python mgf2hdf_cospred.py -w split_usi # Split the dataset into train and test set.
python mgf2hdf_cospred.py -w train # Convert training set into full spectrum bins
python mgf2hdf_cospred.py -w test # Convert testing set into full spectrum bins
At the end, a few files will be generated. train.hdf5
and test.hdf5
are input files for the following ML modules.
training_cospred.py
is the script for customized training. Workflows could be selected by arguments, including 1) -t
: fine-tuning / continue training the existing model; 2) -f
: opt in full MS/MS spectrum model instead of B/Y ions; 3) -c
: chunking the input dataset (to prevent memory overflow by large dataset); 4) -b
: opt in for BiGRU model instead of Transformer.
python training_cospred.py -b # Training B/Y ion spectrum prediction using BiGRU architecture.
python training_cospred.py # Training B/Y ion spectrum prediction using Transformer architecture.
python training_cospred.py -bf # Training full spectrum prediction using BiGRU architecture.
python training_cospred.py -f # Training full spectrum prediction using Transformer architecture.
python training_cospred.py -bft # Fine-tuning full spectrum prediction using BiGRU architecture with pre-trained weights.
During the training procedure under each epoch, model weights files will be auto-generated under the folder model_spectra
. Naming of files will be like below,
- For B/Y ion, BiGRU model:
prosit_byion_[YYYYMMDD]_[HHMMSS]_epoch[integer]_loss[numeric].hdf5
- For full spectrum, BiGRU model:
prosit_full_[YYYYMMDD]_[HHMMSS]_epoch[integer]_loss[numeric].hdf5
- For B/Y ion, Transformer model:
transformer_byion_[YYYYMMDD]_[HHMMSS]_epoch[integer]_loss[numeric].pt
- For full spectrum, Transformer model:
transformer_full_[YYYYMMDD]_[HHMMSS]_epoch[integer]_loss[numeric].pt
To fine-tune the foundation model or re-train the model, following scripts and parameters needs to be modified. In this demo, we will use a non unimod chemical modification "Desthiobiotin" for example.
- Ensure that the related information of (DTBIA) is properly added in your
constants.py
file, as following.
# add to alphabet
ALPHABET = {
"C(DTBIA)": 26, # Alphabet
}
# define the mass
MODIFICATION = {
'DTBIA': 296.185,
}
# add to amino acid
AMINO_ACID["C(DTBIA)"] = AMINO_ACID["C"] + MODIFICATION["DTBIA"]
# define the chemical composition
MODIFICATION_COMPOSITION = {
'C(DTBIA)': {'H': 24, 'C': 14, 'O': 3, 'N': 4}, # Chemical composition
}
# annotate the novel modification with proforma, so that pyteomics library can parse
VARMOD_PROFORMA = {
'C(DTBIA)': 'C[+296.185]',
}
Keep the best model under model_spectra
folder as the trained model for inference phase. Some pre-trained model can be downloaded from FigShare.
With trained models, predict spectrum given peptide sequences from peptidelist_test.csv
. All inference results including metrics, plots, and spectra library will be stored under prediction
folder. Predicted spectra will be stored in speclib_prediction.msp
.
Workflows could be selected by arguments, including 1) -f
: opt in full MS/MS spectrum model instead of B/Y ions; 2) -c
: chucking the input dataset (to prevent memory overflow by large dataset); 3) -b
: opt in for BiGRU model instead of Transformer. Some examples like below.
python prediction.py -b # Predict B/Y ion spectrum prediction using BiGRU architecture.
python prediction.py # Predict B/Y ion spectrum prediction using Transformer architecture.
python prediction.py -bf # Predict full spectrum prediction using BiGRU architecture.
python prediction.py -f # Predict full spectrum prediction using Transformer architecture.
python prediction.py -fc # Predict full spectrum prediction using Transformer architecture, with chunking for accomodating large peptide list.
Optionally, performance evaluation could be executed with -e
argument, as long as ground truth a) test.hdf5
or b) example_PSMs.txt
with test_usi.mgf
are provided, so that reference spectrum for the peptides could be extracted from database search result and the raw mass-spec data. Examples as below.
python prediction.py -be # Predict B/Y ion spectrum prediction using BiGRU architecture.
python prediction.py -e # Predict B/Y ion spectrum prediction using Transformer architecture.
python prediction.py -bfe # Predict full spectrum prediction using BiGRU architecture.
python prediction.py -fe # Predict full spectrum prediction using Transformer architecture.
python prediction.py -fce # Predict full spectrum prediction using Transformer architecture, with chunking for accomodating large peptide list.
The outputs of prediction will be generated under prediction
, including predicted spectra library speclib_prediction.msp
and speclib_prediction.mgf
, plots and metrics under prediction_library
, some other intermediate files for recording or diagnosis purpose.
Predicted spectrum and mirror plot for visual evaluation could be separately generated by spectra_plot.py
. By default, the required inputs are peptidelist_predict.csv
(peptides list), test_reformatted.mgf
(reference spectra), and speclib_prediction.mgf
(predicited spectra). File names and location could be defined by params/constants
. Plots will be stored in prediction/plot
folder.
python spectra_plot.py