-
Notifications
You must be signed in to change notification settings - Fork 0
Gene Family Classifier allows for fast and accurate classification of amplicon (or open reading frame) nucleotide sequences.
License
rdpstaff/gfclassify
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
GFClassify was developed to classify sequences into categories, published in CITATION Copyright 2012 Ribosomal Database Project This project is distributed under the terms of the GNU GPLv3 --- The program uses Interpolated Context Models trained with a set of categorized sequences, to determine the categories of an input set of sequences. This program requires: Glimmer 3.0 an installation of BioPython ------------------- Directories ------------------- src/ - Contains the source for GFClassify (score.cc and score.h) and a make file to build the tool. models/ - Models included with GFClassify, currently amoA, pmoA, nifH 1-5 scripts/ - Helper scripts for testing models and processing GFClassify result files (biopython required for some scripts) examples/ - Contains example sequence files ------------------ BUILDING ------------------ Prep: ------- To build gf_classify, first download Glimmer 3.0 from their website (http://www.cbcb.umd.edu/software/glimmer/) and edit the file src/Makefile. Alter the glimmer_dir to point to the directory where Glimmer was extracted. Type 'make' inside the src/ directory. Should compile on any platform with gcc. Information on BioPython Compiling: ------------ (1) Unpack the files with tar ... (2) Change the Makefile to show where your installation will be. Edit 'Makefile' in the src/ directory Change the glimmer_dir variable to the path where the Glimmer installation is e.g. /home/user/glimmer3.02/ If you're not sure what the path is, at the command line, go into the directory where you have installed Glimmer. Then type 'pwd' The output of that command is the path of your cd-hit directory. (3) In the src/ directory and type 'make' to compile gf_classify. The binary will be in the src directory (src/gf_classify). cd src && make No install option is included, you can run gf_classify from the src directory or copy the executable to a directory on your path (ex: /usr/local/bin), or modify your path to include the src directory. NOTE: Building gf_classify will trigger a build of Glimmer, if for some reason Glimmer compilation fails the gf_classify build will fail too. ----------------------- Example/Testing ----------------------- 1. cd to the root of the gf_classify directory 2. Run GFClassify src/gf_classify -c -t 10 examples/fg_amoa_pmoa_bg.fna models/amoa.icm models/pmoa.icm models/bg.icm > gf_classify.txt This command runs gf_classify telling it to check the reverse complement of all query sequences as well as forward orientation and reject any classifications with log-likelihood <= 10 nats. By default gf_classify writes results to stdout, results can be redirected to file with the > operator on the command line. 3. Summarizing results: cut -f 7 gf_classify.txt | sort | uniq -c The output should be (with the header value 'label' omitted) 984 amoa 10675 bg 979 pmoa 1 rejected 4. Sorting sequences by label scripts/sort_results.py examples/fg_amoa_pmoa_bg.fna gf_classify.txt This will create one file per label in the current direcotry containing all sequences with that label. -------------------- Using GFClassify -------------------- Using gf_classify: ------------------- gf_classify requires one or more Interpolated Context Models (ICMs) to categorize a set of input sequences. ICMs can be created using Glimmer and a training set for each category of interest. See Glimmer documentation for more information. When run with no arguments (or -h) gf_classify will output a brief usage message. USAGE: gf_classify [options] <seq_file> <model1> [model2 ...] The command line program takes a sequence file (in fasta format) and one or more ICMs on the command line. If <seq_file> is - gf_classify will read sequences from stdin instead of from file. [options] -t log likelihood threshold cutoff. Log likelihook must be greater than or equal to the threshold, otherwise the input sequence is rejected. (more explation of what this is and how to calculate it) -c consider both the forward and reverse complement of the sequence -- Output is written to standard out. The first line (starting with a #) contains the column headers. The first three columns in every output will be the query sequence id, sequence description and predicted orientation (+/-). Then there is one column for each input ICM listing the nats score of the query when tested against that model. If more than one ICM is used the last column will contain the label with the highest probability. All probabilities are reported in nats (log base e). The included sort_results.py script can be used to sort query sequences in to fasta files based on predicted gene label. Building ICMs: --------------- ICMs for GFClassify are built using the build-icm program in the Glimmer package. The included grid_search.py script can be used to find optimal values for window width, depth, and periodicy given two or more sets of training sequences. The script takes a list of values for each parameter to test the error rate with n-fold cross validation (specified by the -k argument). We recommend --window 8,9,10,11,12,13,14,15 --depth 7,8,9 --period 1. For more detailed help see the Glimmer 3 documentation. http://www.cbcb.umd.edu/software/glimmer/glim302notes.pdf
About
Gene Family Classifier allows for fast and accurate classification of amplicon (or open reading frame) nucleotide sequences.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published