Procesing-Sequencing-Data

Processing Next-Generation sequencing data is one of the most important tasks, yet it could be one of the most daunting tasks at the same time. For my master’s thesis, I had an outstanding lab mate who helped me to develop the pipeline. Here, the processing means using compressed sequencing data sets to human readable format often using command line arguments on UNIX/LINUX operating systems. Results are comma separated value (.csv) files that can be opened in spreadsheet editor like Microsoft Excel. Based on experience on processing various sequencing data sets, UNIX/LINUX operating system is recommended to process NGS data sets as command line arguments most amounts of customizing options. Historically, bioinformatics analysis tools have been developed and optimized in UNIX/LINUX operating systems. As a Microsoft Windows user, I have found that installation and operation on UNIX/LINUX on Windows via Windows Subsystem for Linux (WSL) did not yield the most efficient way to work with sequencing data though I have to say the overall experience was better than installing and operating UNIX/LINUX on Windows Virtual box as it seems to take lots of random access memories (RAM) thus slowing an overall efficiency of computer. Nowadays, most research focused universities and research institutions have high performance scientific computing infrastructure. For my master’s thesis, I set up my own system to understand better about how a computer works. Therefore, I purchased a mini desktop computer then installed Debian based UNIX/LINUX, Ubuntu 22.04 “Jammy Jellyfish” on an empty solid-state drives (SSD). Not surprisingly, user experience has been far superior for processing sequencing data sets. In addition, setting up my own system yielded "sudo" administrator privileges giving direct ownership of software installation and operation.

First Step of Processing NGS Sequencing Data: Quality Verification and Adapter Sequence Trimming

In the figure, raw reads refer to "Next-generation sequencing (NGS) read length refers to the number of base pairs (bp) sequenced from a DNA fragment" (Source: Illumina). Therefore, raw reads refer to the unprocessed and compressed sequencing data (often .gz format) downloaded from a sequencing process facility. Given data sets for my thesis was bulk RNA-Seq, it was pair ended. To process sequencing data for my thesis, the first step was to verify quality and trimming of adapter sequences. The paired-end Illumina reads were first processed with Trim Galore, a wrapper tool that invokes existing software including Cutadapt and FastQC. As such, Cutadapt and FastQC need to be installed first before using Trim Galore. FastQC is a quality assessment Bioinformatics software for sequencing files while Cutadapt is an adapter sequence trimming Bioinformatics software. With Trim Galore, the quality assessment of reads (FastQC) and the trimming process (Cutadapt) are performed with one line of command. The input parameters for Trim Galore include the autodetection of adapter sequence, the default quality Phred score cutoff (20), default encoding type (ASCII+33), default maximum trimming error rate (0.1), default minimum required adapter overlap stringency (1 bp), and default minimum required sequence length for both reads before removed sequence pair (20 bp) for the trimming.

Sequence alignment is an important process to compare nucleotide or amino acid sequences in bioinformatics. By vertically arranging sequences, regions of similarity are compared and then identified to understand functional, structural, and/or evolutionary relationships between the sequences. The two most utilized methods of sequence alignment are global alignment and local alignment. In the global alignment, the entire length of two sequences is aligned. It is often used when two sequences have the same length. In addition, all differences are counted as well in the global alignment (Source: Microbenotes.com). A well-known example of global alignment in the context of Bioinformatics is the Needleman-Wunsch algorithm. Conversely, the local alignment aligns sequences such that only regions with the highest overlapping of matches are aligned (Source: Microbenotes.com). The Smith–Waterman algorithm uses the local alignment. Both global and local alignments use a scoring matrix to quantify the similarity of sequences.

Cutadapt makes use of a modified version of global alignment as an adapter trimming algorithm: semi-global alignment. In Cutadapt, the semi-global alignment does not penalize initial and/or trailing gaps and it gives sequences freedom to shift and differences are penalized only if there is an overlapping region between them (Source: original paper by Martin & Cutadapt manual). In a conventional alignment scoring system, a positive value is given for matches and a negative value is given for mismatches and indels (insertions and deletions). In Cutadapt, the scoring system uses “unit costs” where mismatches, insertions, and deletions are counted as one error. According to the manual of Cutadapt, this allows an easy implementation of the algorithm because a single parameter (maximum error rate) can be used for counting errors. However, using unit costs to replace the scoring system in the alignment process has its limitations; optimal alignment occurs where total costs are minimized. This alignment based on the unit costs would lead to no overlapping of sequences and adapter sequences would not be identified. The algorithm of the Cutadapt has been refined such that it considers maximum overlap between the two sequences with the condition that the allowed error rate is not exceeded. The adapter alignment algorithm of Cutadapt is a five-step process (Source: original paper by Martin & Cutadapt manual). First, all possible overlaps between two sequences are considered and the alignment is computed to minimize the total number of errors in each one. Second, possible alignments that do not exceed the specified maximum error rate are kept. Third, only the alignments with a maximum number of matches are kept. Fourth, in case multiple alignments exist with the same number of matches, only the alignments with the smallest error rate are kept. Lastly, if there are multiple alignment candidates, then the alignment, which starts at the leftmost position within the read, is chosen. The trimming algorithm in Cutadapt is based on Burrows-Wheeler Transformation (BWT). The BWT is an algorithm that arranges data, often text character strings, such that similar characters are grouped together. Once transformed, the results have the same text character strings arranged in different orders. In other words, the algorithm arranges similar characters next to each other making data compression easier (Source: Technopedia.com). Inversing BWT is a process of recovering original character strings. Cutadapt makes use of inversing BWT and it is executed in three steps. First, the subtraction of the input cutoff from all possible qualities is performed. Second, partial sums from all indices to the end of the sequence are computed. Lastly, the sequence at the index where the sum is minimal is trimmed (Source: original paper by Martin & Cutadapt manual).

Second Step of Processing NGS Sequencing Data: Mapping of the Reads

In RNA-Seq analysis, if a reference genome is available for an organism, then the data analysis is usually performed where short sequencing reads are mapped against an annotated reference genome and this is known as genome mapping (Source: original paper by Conesa et al., 2016). The reference genome is a scientifically accepted genome sequence as a standard to compare sequences of interest for a specific organism (Source: Genome.gov). As Drosophila melanogaster is one of the widely studied model organisms, the reference genome sequence has been available since 2000.

HISAT2 is a fast genome mapping software whose acronym stands for Hierarchical Indexing for Spliced Alignment of Transcripts 2. Splicing is a part of post-transcriptional modification of mRNAs where intronic sequence regions are removed. Exon sequences are retained and spliced together to form the mature mRNA molecule. In the case of alternative splicing, a variety of exon combinations can be generated to produce multiple mRNA types from a single gene or genomic sequence. These mRNAs can produce different protein isoforms or functional variants from the same gene. HISAT2 takes this into account; spliced reads are matched back to the genome (Source: original paper by Finotello & Di Camillo, 2015). The algorithm behind HISAT2 is the Ferragina-Manzini (FM) Index, which combines Burrows-Wheeler Transformation and the suffix-array data structure (Source: FM Index by Ferragina & Venturini). In computer science, indexing is a process of organizing an unordered table into an ordered table (Source: Chartio.com). The suffix arrays are data structures to search substrings of interest within a given larger string and the suffix refers to a substring. The process of fast string searching is key to processing extensive information such as sequencing data (Source: Yale University Computer Science Department). HISAT2 uses suffix arrays when indexing the reference sequence by implementing two types of indexes: a global FM index representing the entire genome and many small FM indexes for regions collectively covering the genome using the graph data structure. In computer science, graph theory describes the relationship between nodes and edges (Source: Baeldung.com). In HISAT2, chemical bases are represented as nodes and their relationships are represented as edges (Source: original paper by Kim et al., 2019). The graph theory is implemented for fast alignment of reads. HISAT2 is a two-step process; it first finds candidate locations across the target genome and then identifies these locations via mapping part of each read with the global FM index. In most cases, at least one candidate is identified. Next, HISAT2 selects one local index for each candidate and utilizes it to align the remainder of the read (Source: original paper by Kim et al., 2015, 2019). The output of HISAT2 is SAM files (Sequence Alignment Map) to store the aligned nucleotide sequences in a tab-delimited configuration. The binary version of SAM files is BAM (Binary Alignment Map). The advantage of BAM files is that file sizes are reduced since they are compressed from SAM files.

Third Step of Processing NGS Sequencing Data: Mapping of the Reads

In the RNA-seq analysis, an important step is counting the number of reads mapping to each genomic feature because the raw aligned BAM output is not sufficient for biological interpretation (Liao et al., 2014). The process of counting reads is known as reads summarization or reads quantification. The featureCounts is a fast reads summarization program by assigning sequence reads to genomic features. The genomic feature is a general term describing any genomic region such as genes and introns/exons. Often, two terms, genomic features (features) and counts are used interchangeably. The featureCounts summarizes counts by counting all reads that overlap any exon for each gene based on a gene annotation (Source: original paper by Liao et al., 2014). The gene annotation includes chromosomal locations of exons for numerous genes (Source: original paper by Chisanga et al). GTF (Gene Transfer Format) is a file format for gene annotations, and it is derived from version 2 of GFF (General Feature Format). The algorithm of featureCounts is based on a four-step process. First, a hash table is generated for the reference sequence names. In computer science, hashing is a process that takes input data with different lengths and produces a fixed length as an output using hash codes. The hash table is a type of data structure to store data by the hashing process (Source: Baeldung.com). The hash table in featureCounts is important because it allows a fast matching between sequence names found in the SAM/BAM file and the GTF annotation file. Second, features in the gene annotation file are sorted by the leftmost base start positions. Third, a hierarchy is created by dividing the sequences of gene annotation files into 128-Kilo base (Kb) bins and the features are assigned into genomic bins based on starting positions. Within each bin, features are equally grouped into blocks. The number of blocks in a bin equals the square root of the number of features in each bin such that the number of features in a block closely equals the number of blocks. The hierarchical data structure is a key element of the algorithm because it speeds up the read assignment. Lastly, the algorithm assigns the reads depending on the level of summarization (feature level or meta-feature level). The summarization at the feature level uses information on exons while the summarization at the meta-feature level uses information on genes (Liao et al., 2014).

References:
Illumina NGS: https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/read-length.html
Local & Global Alignments: https://microbenotes.com/local-global-multiple-sequence-alignment
Burrows Wheeler Transformation: https://www.techopedia.com/definition/10467/burrows-wheeler-transform-bwt
Cutadapt: https://doi.org/10.14806/ej.17.1.200 (Original paper by Martin), https://cutadapt.readthedocs.io/en/stable/algorithms.html (Cutadapt online manual documentation)
Paper by Conesa et al., 2016: https://doi.org/10.1186/s13059-016-0881-8
Reference Sequence: https://www.genome.gov/genetics-glossary/Human-Genome-Reference-Sequence
Paper by Finotello & Di Camillo, 2015: https://doi.org/10.1093/bfgp/elu035
Index: https://chartio.com/learn/databases/how-does-indexing-work/ FM Index by Ferragina & Venturini: https://pages.di.unipi.it/ferragina/Libraries/fmindexV2/index.html
Paper by Kim et al: https://doi.org/10.1038/nmeth.3317, https://doi.org/10.1038/s41587-019-0201-4
Suffix Arrays by Yale University Computer Science Department: https://www.cs.yale.edu/homes/aspnes/pinewiki/SuffixArrays.html
Graph Theory: https://www.baeldung.com/cs/graph-theory-intro
Paper by Liao et al: https://doi.org/10.1093/bioinformatics/btt656 Hash Tables: https://www.baeldung.com/cs/hash-tables
Paper by Chisanga et al: https://doi.org/10.1186/s12859-022-04644-8

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Processing_Seq_Data.sh		Processing_Seq_Data.sh
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Procesing-Sequencing-Data

First Step of Processing NGS Sequencing Data: Quality Verification and Adapter Sequence Trimming

Second Step of Processing NGS Sequencing Data: Mapping of the Reads

Third Step of Processing NGS Sequencing Data: Mapping of the Reads

About

Releases

Packages

Languages

echo4922/Procesing-Sequencing-Data

Folders and files

Latest commit

History

Repository files navigation

Procesing-Sequencing-Data

First Step of Processing NGS Sequencing Data: Quality Verification and Adapter Sequence Trimming

Second Step of Processing NGS Sequencing Data: Mapping of the Reads

Third Step of Processing NGS Sequencing Data: Mapping of the Reads

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages