Kaiko Pipeline

Introduction

Put simply, this tool takes raw proteomic input and outputs a FASTA file of those organisms most likely to be present in the proteomic input.

The pipeline uses neural networks to identify peptide sequences from raw proteomic input, which are then aligned against all protein sequences using a diamond search. This offers us a view of those organisms most likely to be present in the proteomic samples, with which we make a FASTA file from the most likely organisms identified.

Setup

Uses python 3.10 and tensorflow 2.11.0. The full list of requirements can be found in Kaiko_volume/setup_libraries.txt

Before first use, a few files are needed.

Downloading Files

Run the file Kaiko_denovo/model/get_data.sh to download the trained Kaiko denovo model.

Download the following files to the Kaiko_volume/Kaiko_stationary_files folder.

UniRef100 FASTA Large file, 80 Gb+.
UniRef100 XML Large file, 100 Gb+.
NCBI Taxonomy dump Less than 1Gb.
Diamond search, choosing the appropriate system. If using Docker, get the Linux version.

Processing

Extract the diamond file from step 4 into its own folder within Kaiko_volume/Kaiko_stationary_files, eg Kaiko_volume/Kaiko_stationary_files/diamond.
Within a command prompt, navigate to the diamond folder created in the previous step and run diamond makedb --in ../uniref100.fasta.gz --db ../uniref100. The process can take a while. Note: If using Linux or Mac, replace diamond with ./diamond.
Extract the contents of NCBI Taxonomy dump to its own folder within Kaiko_volume/Kaiko_stationary_files, eg Kaiko_volume/Kaiko_stationary_files/ncbi_taxa.
Within a command prompt, navigate to the Kaiko_volume/Kaiko_stationary_files folder and run python ExtractUniRefMembers.py. This will make the file uniref100_member_taxa_tbl.csv within Kaiko_volume/Kaiko_stationary_files. Copy this file into the taxa folder from step 3, eg Kaiko_volume/Kaiko_stationary_files/ncbi_taxa. This step can also take some time.

Check

In the end, Kaiko_volume/Kaiko_stationary_files should have two new files, uniref100.dmnd and uniref100.fasta. It should also contain two folders, Kaiko_volume/Kaiko_stationary_files/diamond and Kaiko_volume/Kaiko_stationary_files/ncbi_taxa, if using default names. The diamond folder should contain the diamond file, while the taxa_folder should contain the contents of the NCBI Taxanomy dump (.dmp files), and the file uniref100_member_taxa_tbl.csv. If the names of these two new folders differ from the default used in the readme, the config.yaml file must be edited to point to these new folders, see the repo config.yaml for an example.

Usage

Currently, only .mgf files are supported. To use, simply follow these steps.

Place the input into a separate folder WITHIN the Kaiko_volume/Kaiko_input_files/ directory. This folder should have a descriptive name.
Edit the config.yaml file within the Kaiko_volume directory to include the location of the folder with the input. An example can be found in the current file config.yaml.
Run the command python Kaiko_pipeline_main.py within the main directory of this repo. The kaiko_defaults.yaml file will fill in any necessary parameters not present in config.yaml

The Kaiko_volume/Kaiko_intermediate/ folder will be populated with a few intermediate files. These are named using the mgf_input folder name. The final FASTA output can be found within Kaiko_volume/Kaiko_output/ folder, again named using the folder name of the input.

If you would like to profile the pipeline using cProfile, add the profile = True flag to the config file. To use memory-profiler, within the main repo directory, run mprof run --include-children Kaiko_pipeline_main.py.

Usage with Docker

To use the pipeline within Docker, follow steps 1-2 in Usage, then jump here:

(Docker) Run the command docker build -f Dockerfile_tensorflow2.12.0-py310 -t tensorflow2.12.0-py310 . to make the tensorflow image.
(Docker) Run the command docker build . -t kaiko-py310 to build the Kaiko docker image using the tensorflow image from step 3)
(Docker) Run the command docker run --name Kaiko_container-py310 -v path_Kaiko_volume:/Kaiko_pipeline/Kaiko_volume kaiko-py310 python Kaiko_pipeline_main.py, where path_Kaiko_volume is the absolute path to the Kaiko_volume folder. This allows Docker to store the outputs in Kaiko_volume. For example, such a command may look like docker run --name Kaiko_container-py310 -v C:/Users/memmys/Documents/GitHub/Kaiko_pipeline/Kaiko_volume/:/Kaiko_pipeline/Kaiko_volume kaiko-py310 python Kaiko_pipeline_main.py
(Docker) Make sure to update the config file to point to the Linux version of diamond. See the setup for more details.

The Kaiko_volume/Kaiko_intermediate/ folder will be populated with a few intermediate files. These are named using the mgf_input folder name. The final FASTA output can be found within Kaiko_volume/Kaiko_output/ folder, again named using the folder name of the input.

Unit Tests

After installing the files, we should ensure the denovo network is producing the expected output given the model. To do this, navigate to the main repo folder in a command prompt and run python kaiko_unit_test.py. This runs the denovo model on a predetermined dataset and compares line by line to stored output.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
Casanovo		Casanovo
Kaiko_Deepnovo		Kaiko_Deepnovo
Kaiko_database_parse		Kaiko_database_parse
Kaiko_dms		Kaiko_dms
Kaiko_main		Kaiko_main
Kaiko_unit_test		Kaiko_unit_test
Kaiko_volume		Kaiko_volume
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile_tensorflow2.12.0-py310		Dockerfile_tensorflow2.12.0-py310
LICENSE.txt		LICENSE.txt
README.md		README.md
__init__.py		__init__.py
docker run.txt		docker run.txt
kaiko_defaults.yaml		kaiko_defaults.yaml
kaiko_requirements.txt		kaiko_requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaiko Pipeline

Introduction

Setup

Downloading Files

Processing

Check

Usage

Usage with Docker

Unit Tests

About

Releases 1

Packages

Contributors 3

Languages

License

microbiomedata/kaiko_metaproteome

Folders and files

Latest commit

History

Repository files navigation

Kaiko Pipeline

Introduction

Setup

Downloading Files

Processing

Check

Usage

Usage with Docker

Unit Tests

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages