Put simply, this tool takes raw proteomic input and outputs a FASTA file of those organisms most likely to be present in the proteomic input.
The pipeline uses neural networks to identify peptide sequences from raw proteomic input, which are then aligned against all protein sequences using a diamond search. This offers us a view of those organisms most likely to be present in the proteomic samples, with which we make a FASTA file from the most likely organisms identified.
Uses python 3.10 and tensorflow 2.11.0. The full list of requirements can be found in Kaiko_volume/setup_libraries.txt
Before first use, a few files are needed.
- Run the file
Kaiko_denovo/model/get_data.sh
to download the trained Kaiko denovo model.
Download the following files to the Kaiko_volume/Kaiko_stationary_files
folder.
-
UniRef100 FASTA Large file, 80 Gb+.
-
UniRef100 XML Large file, 100 Gb+.
-
NCBI Taxonomy dump Less than 1Gb.
-
Diamond search, choosing the appropriate system. If using Docker, get the Linux version.
-
Extract the diamond file from step 4 into its own folder within
Kaiko_volume/Kaiko_stationary_files
, egKaiko_volume/Kaiko_stationary_files/diamond
. -
Within a command prompt, navigate to the diamond folder created in the previous step and run
diamond makedb --in ../uniref100.fasta.gz --db ../uniref100
. The process can take a while. Note: If using Linux or Mac, replacediamond
with./diamond
. -
Extract the contents of NCBI Taxonomy dump to its own folder within
Kaiko_volume/Kaiko_stationary_files
, egKaiko_volume/Kaiko_stationary_files/ncbi_taxa
. -
Within a command prompt, navigate to the
Kaiko_volume/Kaiko_stationary_files
folder and runpython ExtractUniRefMembers.py
. This will make the fileuniref100_member_taxa_tbl.csv
withinKaiko_volume/Kaiko_stationary_files
. Copy this file into the taxa folder from step 3, egKaiko_volume/Kaiko_stationary_files/ncbi_taxa
. This step can also take some time.
In the end, Kaiko_volume/Kaiko_stationary_files
should have two new files, uniref100.dmnd
and uniref100.fasta
. It should also contain two folders, Kaiko_volume/Kaiko_stationary_files/diamond
and Kaiko_volume/Kaiko_stationary_files/ncbi_taxa
, if using default names.
The diamond folder should contain the diamond file, while the taxa_folder should contain the contents of the NCBI Taxanomy dump (.dmp files), and the file uniref100_member_taxa_tbl.csv
. If the names of these two new folders differ from the default used in the readme, the config.yaml file must be edited to point to these new folders, see the repo config.yaml for an example.
Currently, only .mgf files are supported. To use, simply follow these steps.
-
Place the input into a separate folder WITHIN the
Kaiko_volume/Kaiko_input_files/
directory. This folder should have a descriptive name. -
Edit the
config.yaml
file within theKaiko_volume
directory to include the location of the folder with the input. An example can be found in the current fileconfig.yaml
. -
Run the command
python Kaiko_pipeline_main.py
within the main directory of this repo. Thekaiko_defaults.yaml
file will fill in any necessary parameters not present inconfig.yaml
The Kaiko_volume/Kaiko_intermediate/
folder will be populated with a few intermediate files. These are named using the mgf_input
folder name. The final FASTA output can be found within Kaiko_volume/Kaiko_output/
folder, again named using the folder name of the input.
- If you would like to profile the pipeline using cProfile, add the
profile = True
flag to the config file. To use memory-profiler, within the main repo directory, runmprof run --include-children Kaiko_pipeline_main.py
.
To use the pipeline within Docker, follow steps 1-2 in Usage, then jump here:
-
(Docker) Run the command
docker build -f Dockerfile_tensorflow2.12.0-py310 -t tensorflow2.12.0-py310 .
to make the tensorflow image. -
(Docker) Run the command
docker build . -t kaiko-py310
to build the Kaiko docker image using the tensorflow image from step 3) -
(Docker) Run the command
docker run --name Kaiko_container-py310 -v path_Kaiko_volume:/Kaiko_pipeline/Kaiko_volume kaiko-py310 python Kaiko_pipeline_main.py
, where path_Kaiko_volume is the absolute path to the Kaiko_volume folder. This allows Docker to store the outputs in Kaiko_volume. For example, such a command may look likedocker run --name Kaiko_container-py310 -v C:/Users/memmys/Documents/GitHub/Kaiko_pipeline/Kaiko_volume/:/Kaiko_pipeline/Kaiko_volume kaiko-py310 python Kaiko_pipeline_main.py
-
(Docker) Make sure to update the config file to point to the Linux version of diamond. See the setup for more details.
The Kaiko_volume/Kaiko_intermediate/
folder will be populated with a few intermediate files. These are named using the mgf_input
folder name. The final FASTA output can be found within Kaiko_volume/Kaiko_output/
folder, again named using the folder name of the input.
After installing the files, we should ensure the denovo network is producing the expected output given the model. To do this, navigate to the main repo folder in a command prompt and run python kaiko_unit_test.py
. This runs the denovo model on a predetermined dataset and compares line by line to stored output.