Skip to content

Latest commit

 

History

History
212 lines (131 loc) · 7.13 KB

README.md

File metadata and controls

212 lines (131 loc) · 7.13 KB

This course was the first in a two-part series covering some of the algorithms underlying bioinformatics. It covers some of the algorithms underlying the following fundamental topics in bioinformatics:

  • assembling genomes,
  • comparing DNA and protein sequences,
  • finding regulatory motifs,
  • analyzing genome rearrangements,
  • identifying proteins,
  • and many other topics.

About the Course

The sequencing of the human genome fueled a computational revolution in biology. As a result, modern biology produces as many new algorithms as any other fundamental realm of science. Accordingly, the newly formed links between computer science and biology affect the way we teach applied algorithms to computer scientists.

Genome sequencing is just one of hundreds of biological problems that have become inextricable from the computational methods required to solve them. In this course, we will uncover some of the algorithmic ideas that are fundamental to an understanding of modern biology. Computational concepts like dynamic programming and graph theory will help us explore algorithms applied to a wide range of biological topics, from finding regulatory motifs to determining whether the human genome has "fragile" regions. Throughout the process, we will apply bioinformatics algorithms to real genetic data.

Course Syllabus

The course was based on six central questions, with the algorithmic ideas that was used to solve them in parentheses:

  • Where Does DNA Replication Begin? (Algorithmic Warm-up)
  • How Do We Sequence Antibiotics? (Brute Force Algorithms)
  • Which DNA Patterns Act As Cellular Clocks? (Greedy and Randomized Algorithms)
  • How Do We Assemble Genomes? (Graph Algorithms)
  • How Do We Compare Biological Sequences? (Dynamic Programming Algorithms)
  • Are There Fragile Regions in the Human Genome? (Combinatorial Algorithms)

=========================================================

This is the first problem in a collection of "code challenges" to accompany Bioinformatics Algorithms: An Active-Learning Approach by Phillip Compeau & Pavel Pevzner.

A k-mer is a string of length k. We define Count(Text, Pattern) as the number of times that a k-mer Pattern appears as a substring of Text. For example,

Count(ACAACTATGCATACTATCGGGAACTATCCT,ACTAT)=3

We note that Count(CGATATATCCATAG, ATA) is equal to 3 (not 2) since we should account for overlapping occurrences of Pattern in Text.

We say that Pattern is a most frequent k-mer in Text if it maximizes Count(Text, Pattern) among all k-mers. For example, "ACTAT" is a most frequent 5-mer in "ACAACTATGCATCACTATCGGGAACTATCCT", and "ATA" is a most frequent 3-mer of "CGATATATCCATAG".

Frequent Words Problem

Find the most frequent k-mers in a string.

Given: A DNA string Text and an integer k.

Return: All most frequent k-mers in Text (in any order).

Sample Dataset

ACGTTGCATGTCGCATGATGCATGAGAGCT
4

Sample Output

CATG GCAT

======================================================

Given integers L and t, a string Pattern forms an (L, t)-clump inside a (larger) string Genome if there is an interval of Genome of length L in which Pattern appears at least t times. For example, TGCA forms a (25,3)-clump in the following Genome: gatcagcataagggtcccTGCAaTGCAtgacaagccTGCAgttgttttac

Clump Finding Problem

Find patterns forming clumps in a string.

Given: A string Genome, and integers k, L, and t.

Return: All distinct k-mers forming (L, t)-clumps in Genome.

Sample Dataset

CGGACTCGACAGATGTGAAGAAATGTGAAGACTGAGTGAAGAGAAGAGGAAACACGACACGACATTGCGACATAATGTACGAATGTAATGTGCCTATGGC
5 75 4

Sample Output

CGACA GAAGA AATGT

=========================================================

Define the skew of a DNA string Genome, denoted Skew(Genome), as the difference between the total number of occurrences of G and C in Genome. Let Prefixi (Genome) denote the prefix (i.e., initial substring) of Genome of length i. For example, the values of Skew(Prefixi ("CATGGGCATCGGCCATACGCC")) are:

0 -1 -1 -1 0 1 2 1 1 1 0 1 2 1 0 0 0 0 -1 0 -1 -2

Minimum Skew Problem

Find a position in a genome minimizing the skew.

Given: A DNA string Genome.

Return: All integer(s) i minimizing Skew(Prefixi (Text)) over all values of i (from 0 to |Genome|).

Sample Dataset

CCTATCGGTGGATTAGCATGTCCCTGTACGTTTCGCCGCGAACTAGTTCACACGGCTTGATGGCAAATGGTTTTTCCGGCGACCGTAATCGTCCACCGAG

Sample Output

53 97

=========================================================

Recall from that different occurrences of a substring can overlap with each other. For example, ATA occurs three times in CGATATATCCATAG.

Pattern Matching Problem

Find all occurrences of a pattern in a string.

Given: Strings Pattern and Genome.

Return: All starting positions in Genome where Pattern appears as a substring.

Sample Dataset

ATAT
GATATATGCATATACTT

Sample Output

1 3 9

=========================================================

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'. Given a nucleotide p, we denote its complementary nucleotide as p. The reverse complement of a string Pattern = p1…pn is the string Pattern = pn … p1 formed by taking the complement of each nucleotide in Pattern, then reversing the resulting string.

For example, the reverse complement of Pattern = "GTCA" is Pattern = "TGAC".

Reverse Complement Problem

Find the reverse complement of a DNA string.

Given: A DNA string Pattern.

Return: Pattern, the reverse complement of Pattern.

Sample Dataset

AAAACCCGGT

Sample Output

ACCGGGTTTT

=========================================================

We say that position i in k-mers p1 … pk and q1 … qk is a mismatch if pi ≠ qi. For example, CGAAT and CGGAC have two mismatches.

Approximate Pattern Matching Problem

Find all approximate occurrences of a pattern in a string.

Given: Strings Pattern and Text along with an integer d.

Return: All starting positions where Pattern appears as a substring of Text with at most d mismatches.

Sample Dataset

ATTCTGGA
CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAATGCCTAGCGGCTTGTGGTTTCTCCTACGCTCC
3

Sample Output

6 7 26 27 78

=========================================================