Add Motif tutorial

tnitka · tnitka · commit 8db7a279a504 · 2024-01-30T10:34:48.000-08:00
diff --git a/resources/tutorial/snekmer_motif_tutorial.ipynb b/resources/tutorial/snekmer_motif_tutorial.ipynb
@@ -0,0 +1,213 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "74e99565-d102-4935-a0b2-470bc42ed6d7",
+   "metadata": {},
+   "source": [
+    "\n",
+    "# Snekmer Motif Demo\n",
+    "\n",
+    "<b>Motif</b> is a pipeline for identification of functionally and structurally relevant amino acid motifs using feature selection.\n",
+    "\n",
+    "Motif uses the output of Model and a set of user-supplied protein families to train support vector machines for in- or out-of-family classification and find which kmers are the most informative.\n",
+    "\n",
+    "In this notebook, we will demonstrate how to use Snekmer Motif to find the kmers most indicative of membership in 3 small families.\n",
+    "\n",
+    "\n",
+    "\n",
+    "## Getting started with MOTIF\n",
+    "\n",
+    "### Setup\n",
+    "\n",
+    "First, install Snekmer using the instructions in the [user installation guide](https://github.com/PNNL-CompBio/Snekmer/).\n",
+    "\n",
+    "Before running Snekmer, verify that files have been placed in an **_input_** directory placed at the same level as the **_config.yaml_** file. The assumed file directory structure is illustrated below.\n",
+    "\n",
+    "    ├── input\n",
+    "    │   ├── A.fasta\n",
+    "    │   ├── B.fasta\n",
+    "    │   ├── C.fasta\n",
+    "    │   ├── D.fasta\n",
+    "    │   ├── etc.\n",
+    "    │   ├── background\n",
+    "    │   │   ├── E.fasta\n",
+    "    │   │   ├── F.fasta\n",
+    "    │   │   ├── G.fasta\n",
+    "    │   │   ├── H.fasta\n",
+    "    │   │   └── etc.\n",
+    "    └── config.yaml\n",
+    "\n",
+    "    (Note: Snekmer automatically creates the **_output_** directory when creating output files, so there is no need to create this folder in advance. Additionally, inclusion of background sequences is optional, but is illustrated above for interested users.)\n",
+    "\n",
+    "    To ensure that snekmer is available in the Jupyter notebook do the following:\n",
+    "    conda activate snekmer\n",
+    "    conda install -c anaconda ipykernel\n",
+    "    python -m ipykernel install --user --name=snekmer\n",
+    "    jupyter notebook\n",
+    "\n",
+    "\n",
+    "\n",
+    "\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f4ea48af-6abe-469c-a427-949a25d10dfc",
+   "metadata": {},
+   "source": [
+    "### Notes on Using Snekmer\n",
+    "\n",
+    "Snekmer assumes that the user will primarily process input files using the command line. For more detailed instructions, refer to the [README](https://github.com/PNNL-CompBio/Snekmer).\n",
+    "\n",
+    "The basic process for running Snekmer Motif is as follows:\n",
+    "\n",
+    "1. Verify that your file directory structure is correct and that the top-level directory contains a **_config.yaml_** file.\n",
+    "   - A **_config.yaml_** template has been included in the Snekmer codebase at **_resources/config.yaml_**.\n",
+    "2. Modify the **_config.yaml_** with the desired parameters.\n",
+    "3. Use the command line to navigate to the directory containing both the **_config.yaml_** file and **_input_** directory.\n",
+    "4. If you are using background families you are not interested in identifying functionally relevant kmers from, run 'snekmer model', then move the FASTA files containing those families to the 'background' directory. You may skip this step if you are performing feature selection on all input families.\n",
+    "5. Run 'snekmer motif'."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a03d541d-f1e2-4e42-8f24-462d5bad5ba3",
+   "metadata": {},
+   "source": [
+    "# Running Snekmer Motif\n",
+    "\n",
+    "First, install Snekmer using the instructions in the [user installation guide](https://snekmer.readthedocs.io/en/latest/getting_started/install.html).\n",
+    "\n",
+    "To ensure that the tutorial runs correctly, activate the conda environment containing your Snekmer installation and run the notebook from the environment.\n",
+    "\n",
+    "If you haven't yet run the [Snekmer tutorial](https://snekmer.readthedocs.io/en/latest/tutorial/index.html), you'll need to do so now. This runs Motif (and the original three Snekmer modes) on the demo example files and produces all output files. The tutorial uses the included default configuration parameters to guide the analysis, but the user can modify these parameters if a different configuration set is desired. The tutorial command line instructions are copied below:\n",
+    "\n",
+    "conda activate snekmer\n",
+    "    cd resources/tutorial/demo_example\n",
+    "    ./run_demo.sh\n",
+    "\n",
+    "Finally, we will initialize some parameters and parse filenames for this demo notebook."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7228680b-722a-4cdb-ba2b-181bae3e2a72",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# imports\n",
+    "import glob\n",
+    "import os\n",
+    "import yaml\n",
+    "from itertools import product\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cbf3925a-9ade-4b17-b7bb-12f11333f140",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# load config file\n",
+    "with open(os.path.join(\"..\", \"..\", \"resources\", \"config.yaml\"), \"r\") as configfile:\n",
+    "    config = yaml.safe_load(configfile)\n",
+    "\n",
+    "print(config)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a9fecdc8-7644-40ca-bd28-b381afe92a88",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "filenames = sorted(\n",
+    "    [\n",
+    "        fa.rstrip(\".gz\")\n",
+    "        for fa, ext in product(\n",
+    "            glob.glob(os.path.join(\"demo_example\", \"input\", \"*\")),\n",
+    "            config[\"input_file_exts\"],\n",
+    "        )\n",
+    "        if fa.rstrip(\".gz\").endswith(f\".{ext}\")\n",
+    "    ]\n",
+    ")\n",
+    "\n",
+    "families = sorted([os.path.splitext(os.path.basename(f))[0] for f in filenames])\n",
+    "\n",
+    "print(families)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "161dd2e2-b44c-4deb-af13-3aed4cc1899c",
+   "metadata": {},
+   "source": [
+    "## Snekmer Motif output\n",
+    "\n",
+    "Snekmer Motif ranks kmers by p-value and support vector machine (SVM) weight after recursive feature elimination. The p-value is calculated as the number of rescoring iterations in which a kmer's SVM weight exceeded its weight on the correctly labeled input data, divided by the total number of rescoring iterations. Output files containing this data are stored in the **motif**/**p_values** directory. The other subdirectories within the **motif** directory are used for intermediate files that may be used to resume long workflows if execution is interrupted for any reason.\n",
+    "\n",
+    "The primary output file has the file name {input file name}.csv and has 5 columns:\n",
+    "\n",
+    "* **kmer**: The recoded amino acid sequence evaluated as a feature.\n",
+    "* **real score**: The normalized weight learned for the kmer by an SVM trained for one-vs-all classification of the input family against a background of all other input families and any provided background sequences, after performing recursive feature elimination.\n",
+    "* **false positive**: The number of rescoring iterations where an SVM learned a higher weight for a given kmer than that learned on the input data.\n",
+    "* **n**: The number of rescoring iterations performed. This should be the same for each kmer and match the value of **n** contained in config.yaml.\n",
+    "* **p**: The p-value calculated for each kmer.\n",
+    "\n",
+    "The Snekmer tutorial generates 3 output files, one for each family. We'll load and parse one of these:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "aba5c403-f3ca-4978-81f4-efcced6b6810",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# read motif results\n",
+    "results = pd.read_csv(os.path.join(\"demo_example\", \"output\", \"motif\", \"p_values\", \"nxrA.csv\"))\n",
+    "results = results.sort_values(by=\"motif\").reset_index(drop=True)\n",
+    "results[\"motif\"] = results[\"motif\"].astype(str)\n",
+    "results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c82b8336-3981-475d-9028-376875a0e47b",
+   "metadata": {},
+   "source": [
+    "As seen above, kmers are ordered first by p-value and then by their score on real data. "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/snekmer/rules/motif.smk b/snekmer/rules/motif.smk
@@ -1,4 +1,4 @@
-"""model.smk: Module for supervised kmer-based annotation models.
+"""motif.smk: Module for supervised kmer-based annotation models.
 
 author: @tnitka
 

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-"""model.smk: Module for supervised kmer-based annotation models.`
	`1`	`+"""motif.smk: Module for supervised kmer-based annotation models.`
`2`	`2`
`3`	`3`	`author: @tnitka`
`4`	`4`