Skip to content

raramayo/Fasta_Seq_Prepare_Bash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DOI

Fasta_Seq_Prepare_Bash

alt text

Motivation

The main motivation for generating this script was to be able to pre-process both transcriptome and/or proteome fasta files before subjecting them to computational analysis.

By default the '-d' flag (Dust Gremlings) of the script will remove characters that, if present in the sequence, have the potential to interfere and or abort the processing of the file by certain software packages. This is particularly true for InterProScan, which is extremely sensitive to the presence of the character '*", which can usually be found in ENSEMBL protein sequences. These characters, when found, are replaced by the IUPAC 'X' (unknown).

In addition, the '-r', and '-s' flags, control the sorting of the transcriptome and/or proteome fasta sequences from large to small (-r flag), or from small to large (-s flag).

By default, the '-l' flag of the script removes transcripts that are '<=' than 150 nucleotides, and proteins that are '<=' than 50 amino-acids residues, unless other values are assigned.

If desired, the '-u flag' of the script can also remove sequences that are '>=' to a given size provided.

If activated, the '-f flag' of the script will remove sequences belonging to the following biotypes:

  IG_D_gene,
  IG_J_gene,
  TR_D_gene, and
  TR_J_gene

as defined by the fasta headers of ENSEMBL transcriptome and/or protein sequences. For this flag to work properly, these biotypes must have beeb defined and present in the sequences headers. Without this information, no removal can be performed.

A fasta file can also be split into files containing a given number of sequences (e.g., 50000). This behavior (suppressed by default) is controlled by the '-i' flag. This flag must be used with caution. For example, if you select '10' for a fasta file containing 100,000 sequences, you are requesting the script to generate 10,000 fasta files!. This option is only valid for fasta files containing >= 2 sequences.

A transcriptome and/or proteome file can also be clustered using CD-HIT-EST or CD-HIT, respectively. The clustering parameters can be modified by editing the script directly. By default, the clustering parameters are very stringent (i.e., 100% Identity and 100% Length). This option is controlled by the '-c flag'. This behavior is suppressed by default.

If splitting and clustering of a given fasta file is simultaneously requested, the file will be first clustered and then the resulting clustered file will be split.

The '-x' flag, controls the number of cores used in the clustering and sorting processes. To work in the clustering process CD-HIT multi-threading with OpenMP must be enabled.

Finally, one can use the '-z' flag to control where the TMPDIR will be written.

Documentation

###########################################################################
ARAMAYO_LAB

Copyright (C) 2024 Rodolfo Aramayo

This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the Free
Software Foundation, either version 3 of the License, or (at your option)
any later version.

This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.

You should have received a copy of the GNU General Public License along
with this program. If not, see <https://www.gnu.org/licenses/>.

SCRIPT_NAME:                    Fasta_Seq_Prepare_v1.0.5.sh
SCRIPT_VERSION:                 1.0.5

USAGE: Fasta_Seq_Prepare_v1.0.5.sh
 -p Proteins_Fasta_File.fa      # REQUIRED (Proteins File)
                                  (if '-t' Not Provided)
 -t Transcripts_Fasta_File.fa   # REQUIRED (Transcripts File)
                                  (if '-p' Not Provided)
 -l Sequences Lower Size        # OPTIONAL
                                  (default=50 (proteins) | 150 (transcripts)
 -u Sequences Upper Size        # OPTIONAL (default = No Limit)
 -f Filter Biotypes             # OPTIONAL (default = No)
 -d Dust Gremlings              # OPTIONAL (default = Yes)
                                  (Yes: Converts * to X - proteins)
                                  (Yes: Converts * to n - nucleotides)
 -s Sort From Shorter to Larger # OPTIONAL (default = No)
                                  (Yes: Sequences will be sorted
                                  from shorter to larger)
 -r Sort From Larger to Shorter # OPTIONAL (default = No)
                                  (Yes: Sequences will be sorted
                                  from larger to shorter)
 -c Cluster Sequences           # OPTIONAL (default = No)
                                  (Yes: Requires cd-hit Installed)
 -w Fasta Width                 # OPTIONAL (default = 80)
 -i Split fasta file            # OPTIONAL (default = No = 1)
 -x Number of Cores             # OPTIONAL (default = 2)
 -z TMPDIR Location             # OPTIONAL (default=0='TMPDIR Run')

TYPICAL COMMANDS:
 Fasta_Seq_Prepare_v1.0.5.sh -p Proteins_Fasta_File.fa
 Fasta_Seq_Prepare_v1.0.5.sh -p Proteins_Fasta_File.fa -c yes
 Fasta_Seq_Prepare_v1.0.5.sh -t Transcripts_Fasta_File.fa
 Fasta_Seq_Prepare_v1.0.5.sh -t Transcripts_Fasta_File.fa -c yes

INPUT01:          -p FLAG       REQUIRED input ONLY if the '-t' flag
                                  associated file is not provided
INPUT01_FORMAT:                 Proteome Fasta File
INPUT01_DEFAULT:                No default

INPUT02:          -t FLAG       REQUIRED input ONLY if the '-p' flag
                                  associated file is not provided
INPUT02_FORMAT:                 Transcriptome Fasta File
INPUT02_DEFAULT:                No default

INPUT03:          -l FLAG       OPTIONAL input. Lower Size Filtering Value
INPUT03_FORMAT:                 Numeric
INPUT03_DEFAULT:                50 (proteins)
                                150 (transcripts)
                                1 = No Limit (Do Not Filter)
INPUT03_NOTES:
 If provided, this number will be used to reject sequences whose length are
equal to, or shorter than, the number provided.

  Note that the default value is a function of the class of file provided
(i.e., proteome or transcriptome). Also note that any numerical value will be accepted.

INPUT04:          -u FLAG       OPTIONAL input. Sequence Upper Size Filtering Value
INPUT04_FORMAT:                 Numeric
INPUT04_DEFAULT:                0 = No Limit (Do Not Filter)
INPUT04_NOTES:
 If provided, this number will be used to reject sequences whose length are
equal to, or larger than, the number provided.

 Note that any numerical value will be accepted..

INPUT05:          -f FLAG       OPTIONAL input. Filter Biotype
INPUT05_FORMAT:                 yes | no
INPUT05_DEFAULT:                no (Do Not Filter)
INPUT05_NOTES:
 If Activated, sequences belonging to the following biotypes will be
filtered out: IG_D_gene, IG_J_gene, TR_D_gene, and TR_J_gene.

INPUT06:          -d FLAG       OPTIONAL input. Dust Gremlings
INPUT06_FORMAT:                 yes | no
INPUT06_DEFAULT:                yes (Dust Gremlings)
INPUT06_NOTES:
 Dusting will convert:
  '*' to 'X' (for proteins).
  '*' to 'n' (for transcripts).

INPUT07:          -s FLAG       OPTIONAL input. Sort file
INPUT07_FORMAT:                 yes | no
INPUT07_DEFAULT:                no (Do Not Sort)
INPUT07_NOTES:
 If provided, sequences will be sorted from shorter to larger.

INPUT08:          -r FLAG       OPTIONAL input. Sort file
INPUT08_FORMAT:                 yes | no
INPUT08_DEFAULT:                no (Do Not Sort)
INPUT08_NOTES:
 If provided, sequences will be sorted from larger to shorter.

INPUT09:          -c FLAG       OPTIONAL input. Cluster Sequences
INPUT09_FORMAT:                 yes | no
INPUT09_DEFAULT:                no (Do Not Cluster)
INPUT09_NOTES:
 If provided, sequences will be clustered by cd-hit according to stringent
parameters (i.e., 100% Identity and 100% Length).

INPUT10:          -w FLAG       OPTIONAL input
INPUT10_FORMAT:                 Numeric
INPUT10_DEFAULT:                80
INPUT10_NOTES:
 The number sets the width of the sequence of the output fasta file.

INPUT11:          -i FLAG       OPTIONAL input
INPUT11_FORMAT:                 Numeric
INPUT11_DEFAULT:                1 (Do Not Split)
INPUT11_NOTES:
 Determines the number of fasta sequences requested to be present on
each resulting splitted file.

 This option is only valid for fasta files containing >= 2 sequences.

 Be careful with this flag. If you select '10' for a fasta file containing
100,000 sequences, you are requesting the script to generate 10,000 fasta files!

INPUT12:          -x FLAG       OPTIONAL input
INPUT12_FORMAT:                 Numeric
INPUT12_DEFAULT:                2
INPUT12_NOTES:
 The maximum number of cores requested should be equal to N-1; where N is
the total number of cores available in the computer performing the analysis.

INPUT13:          -z FLAG       OPTIONAL input
INPUT13_FORMAT:                 Numeric: '0' == TMPDIR Run | '1' == Local Run
INPUT13_DEFAULT:                '0' == TMPDIR Run
INPUT13_NOTES:
 '0' Processes the data in the $TMPDIR directory of the computer used or of
the node assigned by the SuperComputer scheduler.

 Processing the data in the $TMPDIR directory of the node assigned by the
SuperComputer scheduler reduces the possibility of file error generation
due to network traffic,

 '1' Processes the data in the same directory where the script is being run.

DEPENDENCIES:
 GNU AWK:       Required (https://www.gnu.org/software/gawk/)
 GNU COREUTILS: Required (https://www.gnu.org/software/coreutils/)
 CD-HIT:        Required if clustering is invoked
                 (https://github.com/weizhongli/cdhit)
                 (PMID: 16731699, and PMID: 23060610)

Author:                            Rodolfo Aramayo
WORK_EMAIL:                        [email protected]
PERSONAL_EMAIL:                    [email protected]

Repository: https://github.com/raramayo/Fasta_Seq_Prepare_Bash
Issues:     https://github.com/raramayo/Fasta_Seq_Prepare_Bash/issues
###########################################################################

Development/Testing Environment:

Distributor ID:       Apple, Inc.
Description:          Apple M1 Max
Release:              14.4.1
Codename:             Sonoma
Distributor ID:       Ubuntu
Description:          Ubuntu 22.04.3 LTS
Release:              22.04
Codename:             jammy

Required Script Dependencies:

Version Number: 5.3.0, API 4.0

GNU Awk 5.3.0, API 4.0
Copyright (C) 1989, 1991-2023 Free Software Foundation.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.

Version Number: 8.30

(GNU coreutils) 9.4
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Richard M. Stallman and David MacKenzie.

Version Number: 4.8.1

		====== CD-HIT version 4.8.1 (built on Jan 21 2024) ======

Usage: cd-hit [Options]

Options

   -i	input filename in fasta format, required, can be in .gz format
   -o	output filename, required
   -c	sequence identity threshold, default 0.9
 	this is the default cd-hit's "global sequence identity" calculated as:
 	number of identical amino acids or bases in alignment
 	divided by the full length of the shorter sequence
   -G	use global sequence identity, default 1
 	if set to 0, then use local sequence identity, calculated as :
 	number of identical amino acids or bases in alignment
 	divided by the length of the alignment
 	NOTE!!! don't use -G 0 unless you use alignment coverage controls
 	see options -aL, -AL, -aS, -AS
   -b	band_width of alignment, default 20
   -M	memory limit (in MB) for the program, default 800; 0 for unlimitted;
   -T	number of threads, default 1; with 0, all CPUs will be used
   -n	word_length, default 5, see user's guide for choosing it
   -l	length of throw_away_sequences, default 10
   -t	tolerance for redundance, default 2
   -d	length of description in .clstr file, default 20
 	if set to 0, it takes the fasta defline and stops at first space
   -s	length difference cutoff, default 0.0
 	if set to 0.9, the shorter sequences need to be
 	at least 90% length of the representative of the cluster
   -S	length difference cutoff in amino acid, default 999999
 	if set to 60, the length difference between the shorter sequences
 	and the representative of the cluster can not be bigger than 60
   -aL	alignment coverage for the longer sequence, default 0.0
 	if set to 0.9, the alignment must covers 90% of the sequence
   -AL	alignment coverage control for the longer sequence, default 99999999
 	if set to 60, and the length of the sequence is 400,
 	then the alignment must be >= 340 (400-60) residues
   -aS	alignment coverage for the shorter sequence, default 0.0
 	if set to 0.9, the alignment must covers 90% of the sequence
   -AS	alignment coverage control for the shorter sequence, default 99999999
 	if set to 60, and the length of the sequence is 400,
 	then the alignment must be >= 340 (400-60) residues
   -A	minimal alignment coverage control for the both sequences, default 0
 	alignment must cover >= this value for both sequences
   -uL	maximum unmatched percentage for the longer sequence, default 1.0
 	if set to 0.1, the unmatched region (excluding leading and tailing gaps)
 	must not be more than 10% of the sequence
   -uS	maximum unmatched percentage for the shorter sequence, default 1.0
 	if set to 0.1, the unmatched region (excluding leading and tailing gaps)
 	must not be more than 10% of the sequence
   -U	maximum unmatched length, default 99999999
 	if set to 10, the unmatched region (excluding leading and tailing gaps)
 	must not be more than 10 bases
   -B	1 or 0, default 0, by default, sequences are stored in RAM
 	if set to 1, sequence are stored on hard drive
 	!! No longer supported !!
   -p	1 or 0, default 0
 	if set to 1, print alignment overlap in .clstr file
   -g	1 or 0, default 0
 	by cd-hit's default algorithm, a sequence is clustered to the first
 	cluster that meet the threshold (fast cluster). If set to 1, the program
 	will cluster it into the most similar cluster that meet the threshold
 	(accurate but slow mode)
 	but either 1 or 0 won't change the representatives of final clusters
   -sc	sort clusters by size (number of sequences), default 0, output clusters by decreasing length
 	if set to 1, output clusters by decreasing size
   -sf	sort fasta/fastq by cluster size (number of sequences), default 0, no sorting
 	if set to 1, output sequences by decreasing cluster size
 	this can be very slow if the input is in .gz format
   -bak	write backup cluster file (1 or 0, default 0)
   -h	print this help

   Questions, bugs, contact Weizhong Li at [email protected]
   For updated versions and information, please visit: http://cd-hit.org
                                                    or https://github.com/weizhongli/cdhit

   cd-hit web server is also available from http://cd-hit.org

   If you find cd-hit useful, please kindly cite:

   "CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik. Bioinformatics, (2006) 22:1658-1659
   "CD-HIT: accelerated for clustering the next generation sequencing data", Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu & Weizhong Li. Bioinformatics, (2012) 28:3150-3152

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages