Skip to content

Latest commit

 

History

History
324 lines (232 loc) · 8 KB

File metadata and controls

324 lines (232 loc) · 8 KB

Command Reference

Complete reference for all bioinfo-tools commands and options.

Global Options

These options are available for the main bioinfo-tools command:

bioinfo-tools [--help] [--version] <command> [<args>]

Options

  • -h, --help - Show help message and exit
  • --version - Show program version number and exit

Available Commands


extract-cds

Filter CDS (Coding Sequences) from GenBank files, keeping only genes in the provided list.

Usage

bioinfo-tools extract-cds -i <input_folder> -g <genes_file> -o <output_folder>

Required Arguments

  • --input-folder, -i DIR - Input folder containing GenBank files (.gbk, .gb, .genbank)
  • --genes-list, -g FILE - File containing gene names (one per line)
  • --output-folder, -o DIR - Output folder for filtered GenBank files

Examples

# Basic usage
bioinfo-tools extract-cds -i genbank_files/ -g genes.txt -o filtered_output/

# With long options
bioinfo-tools extract-cds --input-folder genbank_files/ \
                          --genes-list genes.txt \
                          --output-folder filtered_output/

Input Format

genes.txt:

dnaA
rpoB
recA
gyrA

Output

Filtered GenBank files containing only the specified genes and source features.


extract-proteins

Extract CDS translations (amino acid sequences) from GenBank files, organized by gene name.

Usage

bioinfo-tools extract-proteins -i <input_folder> -g <genes_file> -o <output_folder>

Required Arguments

  • --input-folder, -i DIR - Input folder containing GenBank files
  • --genes-list, -g FILE - File containing gene names (one per line)
  • --output-folder, -o DIR - Output folder for amino acid FASTA files

Examples

# Basic usage
bioinfo-tools extract-proteins -i genbank_files/ -g genes.txt -o proteins_output/

# With long options
bioinfo-tools extract-proteins --input-folder genbank_files/ \
                               --genes-list genes.txt \
                               --output-folder proteins_output/

Output Structure

proteins_output/
├── dnaA/
│   ├── genome1.fasta
│   └── genome2.fasta
├── rpoB/
│   ├── genome1.fasta
│   └── genome2.fasta
└── recA/
    └── genome2.fasta

Each FASTA file contains the amino acid sequence for that gene from that genome.


extract-genes

Extract CDS nucleotide sequences from GenBank files, organized by gene name.

Usage

bioinfo-tools extract-genes -i <input_folder> -g <genes_file> -o <output_folder>

Required Arguments

  • --input-folder, -i DIR - Input folder containing GenBank files
  • --genes-list, -g FILE - File containing gene names (one per line)
  • --output-folder, -o DIR - Output folder for nucleotide FASTA files

Examples

# Basic usage
bioinfo-tools extract-genes -i genbank_files/ -g genes.txt -o genes_output/

# With long options
bioinfo-tools extract-genes --input-folder genbank_files/ \
                            --genes-list genes.txt \
                            --output-folder genes_output/

Output Structure

genes_output/
├── dnaA/
│   ├── genome1.fasta
│   └── genome2.fasta
├── rpoB/
│   ├── genome1.fasta
│   └── genome2.fasta
└── recA/
    └── genome2.fasta

Each FASTA file contains the nucleotide sequence for that gene from that genome.


blast

Perform BLAST searches using multiple query and database files. Databases are automatically formatted.

Usage

bioinfo-tools blast -q <query_folder> -d <db_folder> -t <db_type> -b <blast_type> -e <evalue>

Required Arguments

  • --query-folder, -q DIR - Folder containing query FASTA files
  • --db-folder, -d DIR - Folder containing database FASTA files
  • --db-type, -t {nucl,prot} - Database type:
    • nucl - Nucleotide database
    • prot - Protein database
  • --blast-type, -b TYPE - BLAST program to use:
    • blastn - Nucleotide vs nucleotide
    • blastp - Protein vs protein
    • blastx - Translated nucleotide vs protein
    • tblastn - Protein vs translated nucleotide
    • tblastx - Translated nucleotide vs translated nucleotide
  • --evalue, -e FLOAT - E-value threshold (e.g., 1e-5, 0.001)

Optional Arguments

  • --output-folder, -o DIR - Output folder for BLAST results (default: blast_outputs)
  • --outfmt INT - BLAST output format (0-11, default: 6 - tabular)

Examples

# Basic nucleotide BLAST
bioinfo-tools blast -q queries/ -d databases/ -t nucl -b blastn -e 1e-5

# Protein BLAST with custom output
bioinfo-tools blast -q protein_queries/ -d protein_dbs/ -t prot -b blastp -e 0.001 -o my_results/

# With long options
bioinfo-tools blast --query-folder queries/ \
                    --db-folder databases/ \
                    --db-type nucl \
                    --blast-type blastn \
                    --evalue 1e-5 \
                    --output-folder results/ \
                    --outfmt 6

Output Format

By default, BLAST outputs tabular format (outfmt 6):

query_id  subject_id  %identity  alignment_length  mismatches  gap_opens  ...

Output Files

Each query-database combination produces a separate result file:

blast_outputs/
├── query1.fasta_database1.fasta_result.txt
├── query1.fasta_database2.fasta_result.txt
├── query2.fasta_database1.fasta_result.txt
└── query2.fasta_database2.fasta_result.txt

Supported Output Formats

  • 0 - Pairwise
  • 1 - Query-anchored showing identities
  • 2 - Query-anchored no identities
  • 3 - Flat query-anchored, show identities
  • 4 - Flat query-anchored, no identities
  • 5 - XML
  • 6 - Tabular (default)
  • 7 - Tabular with comment lines
  • 8 - Text ASN.1
  • 9 - Binary ASN.1
  • 10 - Comma-separated values
  • 11 - BLAST archive format

Common Patterns

Processing Multiple Datasets

# Extract and process genes from multiple genomes
for dataset in dataset1 dataset2 dataset3; do
    bioinfo-tools extract-genes -i ${dataset}/genbank/ \
                                -g genes.txt \
                                -o ${dataset}/genes_output/
done

Pipeline Example

# Complete workflow
# 1. Filter GenBank files
bioinfo-tools extract-cds -i raw_genbank/ -g important_genes.txt -o filtered_gbk/

# 2. Extract protein sequences
bioinfo-tools extract-proteins -i filtered_gbk/ -g important_genes.txt -o proteins/

# 3. Extract nucleotide sequences
bioinfo-tools extract-genes -i filtered_gbk/ -g important_genes.txt -o genes/

# 4. Run BLAST analysis
bioinfo-tools blast -q proteins/dnaA/ -d reference_dbs/ -t prot -b blastp -e 1e-10

Getting Help

# General help
bioinfo-tools --help

# Command-specific help
bioinfo-tools extract-cds --help
bioinfo-tools extract-proteins --help
bioinfo-tools extract-genes --help
bioinfo-tools blast --help

Exit Codes

All commands return standard exit codes:

  • 0 - Success
  • 1 - Error (check log messages)
  • 130 - Interrupted by user (Ctrl+C)

Logging

All commands provide informative logging output:

2024-01-01 12:00:00 - INFO - Gene Extractor started
2024-01-01 12:00:00 - INFO - Input folder: genbank_files/
2024-01-01 12:00:00 - INFO - Loaded 5 genes from genes.txt
2024-01-01 12:00:00 - INFO - Found 10 file(s) in genbank_files/
...
2024-01-01 12:00:05 - INFO - Gene Extractor finished successfully

Environment Variables

Currently, bioinfo-tools does not use environment variables for configuration. All settings are passed via command-line arguments.

See Also