GitHub - kchu25/MOTIFs.jl: DNA Motif discovery that includes the discovery of flexible (long or gapped) motifs.

Finding Motifs Using DNA Images Derived From Sparse Representations

General purpose motif discovery package that includes the discovery of flexible (long or gapped) motifs.

This code repository corresponds to the paper Finding Motifs Using DNA Images Derived From Sparse Representations, which has been published in Oxford Bioinformatics.

Motivation

Traditional methods such as STREME and HOMER excel at efficiently finding the primary motifs of a transcription factor. This raises the question: why do we require an additional motif discovery method?

Because there may be more patterns in the datasets that aren't fully captured. This is especially evident for context-dependent binding sites, such as C2H2 zinc finger, and cooperative binding patterns observed in in-vivo datasets from ChIP-Seq.

Our work reveals that over half of the ChIP-Seq datasets selected from the JASPAR 2022 database contain transposable elements that overlap the primary binding sites. For instance, see NFE2L2, YY1, STAT1, SRF, AR (Manuscript Figure 4):

These long patterns present challenges for traditional k-mer-based methods due to their exponential time and space complexity.

Furthermore, many datasets exhibit a large presence of gapped motifs. For example, we found that ChIP-Seq datasets from both JASPAR and Factorbook often contains gapped motifs (Manuscript Figure 6):

and the spacers that characterized the gapped motifs can be widely varied (Supplementary Material Figure 2).

Last, there are cooperative binding patterns, e.g., (Manuscript Figure 5):

for which we see consecutive occurrences of Oct4 and cooccurrence of Oct4 and Zic3, in addition to the Oct4-Sox2 motif. The presence of gapped motifs and cooperative binding patterns presents challenges for k-mer-based methods as well, as these methods are primarily designed to detect ungapped motifs.

Installation

To install MOTIFs.jl use Julia's package manager:

pkg> add MOTIFs

Usage

In Julia:

using MOTIFs

# Do motif discovery on a set of DNA sequences in a fasta file, 
# where the `<fasta-path>` and `<output-folder-path>` are the 
# absolute filepaths as strings.

discover_motifs(<fasta-path>, <output-folder-path>)

# for example
discover_motifs("home/shane/mydata/fasta.fa", 
                "home/shane/mydata/out/")

Software requirements

This package currectly requires Weblogo for PWM plotting. Install Weblogo by running the following command with python3 and pip3:

pip install weblogo

Hardware requirements

Currently, a GPU is required for this package as it utilizes CUDA.jl to accelerate certain computations. However, I plan to implement a CPU extension in the future.

Adjustable Hyperparameters

# The user can adjust the number of epochs for training the network.
discover_motifs(<fasta-path>, <output-folder-path>; num_epochs=10)

Interpret the results

Summary page

Once the motif discovery process is complete, a summary.html page is generated in the output folder, providing a comprehensive overview of the results.

For instance, here is an example result page showcasing data from the SP1 transcription factor from JASPAR:

The top of the result page has

Number of sequences: The total number of DNA sequences in the dataset.
Label: A label assigned for each discovered motifs.
- Each label is hyperlinked to a text file in TRANSFAC format that can be parsed.
P-value: The satistical significance of the discovered motif using Fisher exact test (Manuscript section 2.7.2).
# instances: An estimate of the number of occurrences in the dataset (Manuscript section 2.7.3).
Logo: Position weight matricies.
- Press the Reverse complement button to view the logo in alternative orientation.

Note that in in-vivo datasets, especially for zinc-finger proteins, a large number of motifs can be observed, often characterized by variable spacings in their binding sites.

Statistically insignificant motifs

Some of the motifs shown here have their p-values in grey, indicating that they have a relatively high p-value (p > 0.01, Fisher exact test). This statistical result simply suggests that these motifs are not significantly enriched relative to the shuffled DNA strings (Manuscript section 2.7.2); it does not imply that these motifs do not exist in the dataset.

Cite this work

You can cite this work using the following BibTex entry:

@article{chu2023finding,
  title={Finding Motifs Using DNA Images Derived From Sparse Representations},
  author={Chu, Shane K and Stormo, Gary D},
  journal={Bioinformatics},
  pages={btad378},
  year={2023},
  publisher={Oxford University Press}
}

Contact

If you have any questions or suggestions regarding the usage or source code, please feel free to reach out to me at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
docs		docs
imgs		imgs
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Manifest.toml		Manifest.toml
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finding Motifs Using DNA Images Derived From Sparse Representations

Table of contents

Motivation

Installation

Usage

Software requirements

Hardware requirements

Adjustable Hyperparameters

Interpret the results

Summary page

Statistically insignificant motifs

Cite this work

Contact

About

Releases

Packages

Languages

License

kchu25/MOTIFs.jl

Folders and files

Latest commit

History

Repository files navigation

Finding Motifs Using DNA Images Derived From Sparse Representations

Table of contents

Motivation

Installation

Usage

Software requirements

Hardware requirements

Adjustable Hyperparameters

Interpret the results

Summary page

Statistically insignificant motifs

Cite this work

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages