Skip to content

kchu25/MOTIFs.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Finding Motifs Using DNA Images Derived From Sparse Representations

General purpose motif discovery package that includes the discovery of flexible (long or gapped) motifs.

This code repository corresponds to the paper Finding Motifs Using DNA Images Derived From Sparse Representations, which has been published in Oxford Bioinformatics.

Table of contents

Motivation

Traditional methods such as STREME and HOMER excel at efficiently finding the primary motifs of a transcription factor. This raises the question: why do we require an additional motif discovery method?

Because there may be more patterns in the datasets that aren't fully captured. This is especially evident for context-dependent binding sites, such as C2H2 zinc finger, and cooperative binding patterns observed in in-vivo datasets from ChIP-Seq.

Our work reveals that over half of the ChIP-Seq datasets selected from the JASPAR 2022 database contain transposable elements that overlap the primary binding sites. For instance, see NFE2L2, YY1, STAT1, SRF, AR (Manuscript Figure 4):

image info

These long patterns present challenges for traditional k-mer-based methods due to their exponential time and space complexity.

Furthermore, many datasets exhibit a large presence of gapped motifs. For example, we found that ChIP-Seq datasets from both JASPAR and Factorbook often contains gapped motifs (Manuscript Figure 6):

image info

and the spacers that characterized the gapped motifs can be widely varied (Supplementary Material Figure 2).

Last, there are cooperative binding patterns, e.g., (Manuscript Figure 5):

image info

for which we see consecutive occurrences of Oct4 and cooccurrence of Oct4 and Zic3, in addition to the Oct4-Sox2 motif. The presence of gapped motifs and cooperative binding patterns presents challenges for k-mer-based methods as well, as these methods are primarily designed to detect ungapped motifs.

Installation

To install MOTIFs.jl use Julia's package manager:

pkg> add MOTIFs

Usage

In Julia:

using MOTIFs

# Do motif discovery on a set of DNA sequences in a fasta file, 
# where the `<fasta-path>` and `<output-folder-path>` are the 
# absolute filepaths as strings.

discover_motifs(<fasta-path>, <output-folder-path>)

# for example
discover_motifs("home/shane/mydata/fasta.fa", 
                "home/shane/mydata/out/")

Software requirements

This package currectly requires Weblogo for PWM plotting. Install Weblogo by running the following command with python3 and pip3:

pip install weblogo

Hardware requirements

Currently, a GPU is required for this package as it utilizes CUDA.jl to accelerate certain computations. However, I plan to implement a CPU extension in the future.

Adjustable Hyperparameters

# The user can adjust the number of epochs for training the network.
discover_motifs(<fasta-path>, <output-folder-path>; num_epochs=10)

Interpret the results

Summary page

Once the motif discovery process is complete, a summary.html page is generated in the output folder, providing a comprehensive overview of the results.

For instance, here is an example result page showcasing data from the SP1 transcription factor from JASPAR:

image info

The top of the result page has

  • Number of sequences: The total number of DNA sequences in the dataset.
  • Label: A label assigned for each discovered motifs.
    • Each label is hyperlinked to a text file in TRANSFAC format that can be parsed.
  • P-value: The satistical significance of the discovered motif using Fisher exact test (Manuscript section 2.7.2).
  • # instances: An estimate of the number of occurrences in the dataset (Manuscript section 2.7.3).
  • Logo: Position weight matricies.
    • Press the Reverse complement button to view the logo in alternative orientation.

Note that in in-vivo datasets, especially for zinc-finger proteins, a large number of motifs can be observed, often characterized by variable spacings in their binding sites.

image info

Statistically insignificant motifs

Some of the motifs shown here have their p-values in grey, indicating that they have a relatively high p-value (p > 0.01, Fisher exact test). This statistical result simply suggests that these motifs are not significantly enriched relative to the shuffled DNA strings (Manuscript section 2.7.2); it does not imply that these motifs do not exist in the dataset.

image info

Cite this work

You can cite this work using the following BibTex entry:

@article{chu2023finding,
  title={Finding Motifs Using DNA Images Derived From Sparse Representations},
  author={Chu, Shane K and Stormo, Gary D},
  journal={Bioinformatics},
  pages={btad378},
  year={2023},
  publisher={Oxford University Press}
}

Contact

If you have any questions or suggestions regarding the usage or source code, please feel free to reach out to me at [email protected].