Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
kchu25 committed Jun 16, 2023
1 parent baa46b4 commit 7223582
Show file tree
Hide file tree
Showing 4 changed files with 32 additions and 11 deletions.
43 changes: 32 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,37 +13,37 @@ This code repository corresponds to the paper [Finding Motifs Using DNA Images D
- [Motivation](#Motivation)
- [Installation](#Installation)
- [Usage](#Usage)
- [Interpret the results](#Interpret-the-results)
- [Software requirements](#Software-requirements)
- [Hardware requirements](#Hardware-requirements)
- [Adjustable Hyperparameters](#Adjustable-Hyperparameters)
- [Interpret the results](#Interpret-the-results)
- [Cite this work](#Cite-this-work)
- [Contact](#Contact)


## Motivation
Traditional methods such as [STREME](https://meme-suite.org/meme/doc/streme.html) and [HOMER](http://homer.ucsd.edu/homer/motif/) excel at efficiently finding the primary motifs of a transcription factor. This begs the question -- why do we need another motif discovery method?
Traditional methods such as [STREME](https://meme-suite.org/meme/doc/streme.html) and [HOMER](http://homer.ucsd.edu/homer/motif/) excel at efficiently finding the primary motifs of a transcription factor. This raises the question: why do we require an additional motif discovery method?

Because there may be more patterns in the datasets that aren't fully captured. This is even more so in in-vivo datasets such as ChIP-Seq.
Because there may be more patterns in the datasets that aren't fully captured. This is especially evident for context-dependent binding sites, such as C2H2 zinc finger, and cooperative binding patterns observed in in-vivo datasets from ChIP-Seq.

Our work finds that more than half the ChiP-Seq datasets that we've selected from the database [JASPAR 2022](https://jaspar.genereg.net/) containes transposable elements that overlaps the primary binding sites. For instance, see [NFE2L2](https://en.wikipedia.org/wiki/NFE2L2), [YY1](https://en.wikipedia.org/wiki/YY1), [STAT1](https://en.wikipedia.org/wiki/STAT1), [SRF](https://en.wikipedia.org/wiki/Serum_response_factor), [AR](https://en.wikipedia.org/wiki/Androgen_receptor) ([Manuscript Figure 4](https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad378/7192989?utm_source=advanceaccess&utm_campaign=bioinformatics&utm_medium=email)):
Our work reveals that over half of the ChIP-Seq datasets selected from the [JASPAR 2022](https://jaspar.genereg.net/) database contain transposable elements that overlap the primary binding sites. For instance, see [NFE2L2](https://en.wikipedia.org/wiki/NFE2L2), [YY1](https://en.wikipedia.org/wiki/YY1), [STAT1](https://en.wikipedia.org/wiki/STAT1), [SRF](https://en.wikipedia.org/wiki/Serum_response_factor), [AR](https://en.wikipedia.org/wiki/Androgen_receptor) ([Manuscript Figure 4](https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad378/7192989?utm_source=advanceaccess&utm_campaign=bioinformatics&utm_medium=email)):

![image info](./imgs/long_1.png)

These long patterns pose challenges for traditional k-mer-based methods as space complexity is exponential.
These long patterns present challenges for traditional k-mer-based methods due to their exponential space complexity.

Furthermore, many datasets exhibit a large presence of gapped motifs. For example, we found that ChIP-Seq datasets from both [JASPAR](https://jaspar.genereg.net/) and [Factorbook](https://www.factorbook.org/) often contains gapped motifs ([Manuscript Figure 6](https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad378/7192989?utm_source=advanceaccess&utm_campaign=bioinformatics&utm_medium=email)):

![image info](./imgs/gapped.png)

and the spacers that characterized the gapped motifs [can be widely varied (Supplementary Material Figure 2)](./imgs/gaps.png).

Last, there may be cooperative binding patterns, e.g., ([Manuscript Figure 5](https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad378/7192989?utm_source=advanceaccess&utm_campaign=bioinformatics&utm_medium=email)):
Last, there are cooperative binding patterns, e.g., ([Manuscript Figure 5](https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad378/7192989?utm_source=advanceaccess&utm_campaign=bioinformatics&utm_medium=email)):

![image info](./imgs/avsec3.png)


for which we see consecutive occurrences of [Oct4](https://en.wikipedia.org/wiki/Oct-4) and cooccurrence of [Oct4](https://en.wikipedia.org/wiki/Oct-4) and [Zic3](https://en.wikipedia.org/wiki/ZIC3), in addition to the Oct4-Sox2 motif. The presence of gapped motifs and cooperative binding patterns presents challenges for k-mer-based methods, as these methods are primarily designed to detect ungapped motifs.
for which we see consecutive occurrences of [Oct4](https://en.wikipedia.org/wiki/Oct-4) and cooccurrence of [Oct4](https://en.wikipedia.org/wiki/Oct-4) and [Zic3](https://en.wikipedia.org/wiki/ZIC3), in addition to the Oct4-Sox2 motif. The presence of gapped motifs and cooperative binding patterns presents challenges for k-mer-based methods as well, as these methods are primarily designed to detect ungapped motifs.


## Installation
Expand All @@ -68,10 +68,6 @@ discover_motifs("home/shane/mydata/fasta.fa",
"home/shane/mydata/out/")
````

## Interpret the results
(coming soon)


## Software requirements
This package currectly requires [Weblogo](http://weblogo.threeplusone.com/manual.html#download) for PWM plotting. Install Weblogo by running the following command with python3 and pip3:
```bash
Expand All @@ -89,6 +85,31 @@ Currently, a GPU is required for this package as it utilizes [CUDA.jl](https://g
discover_motifs(<fasta-path>, <output-folder-path>; num_epochs=10)

````
## Interpret the results

### Summary page
Once the motif discovery process is complete, a summary.html page is generated in the output folder, providing a comprehensive overview of the results.

For instance, here is an example result page showcasing data from the [SP1 transcription factor from JASPAR](https://jaspar.genereg.net/matrix/MA0079.3/):

![image info](./imgs/re_top.png)

The top of the result page has
- labels: A number assigned for each discovered motifs.
* Each label is hyperlinked to a text file in TRANSFAC format that can be parsed.
- p-value: The satistical significance of the discovered motif using Fisher exact test ([Manuscript section 2.7.2](https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad378/7192989?utm_source=advanceaccess&utm_campaign=bioinformatics&utm_medium=email)).
- instances: An estimate of the number of occurrences in the dataset ([Manuscript section 2.7.3](https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad378/7192989?utm_source=advanceaccess&utm_campaign=bioinformatics&utm_medium=email)).
- logo: Position weight matricies.


Note that in in-vivo datasets, especially for zinc-finger proteins, a large number of motifs can be observed, often characterized by variable spacings in their binding sites.

![image info](./imgs/re_gap.png)

### Statistically insignificant motifs
Some of the motifs shown here have their p-values in grey, which typically suggests that these motifs are not particularly enriched in the dataset. However, this does not imply that they do not exist in the dataset.

![image info](./imgs/re_high_pval.png)

## Cite this work

Expand Down
Binary file added imgs/re_gap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/re_high_pval.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/re_top.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 7223582

Please sign in to comment.