added documentation in the training notebook

dantaki · Sep 15, 2017 · 8c4336c · 8c4336c
1 parent 595d333
commit 8c4336c
Show file tree

Hide file tree

Showing 3 changed files with 52 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -191,4 +191,4 @@ SOFTWARE.
 ## Contact
 :mailbox:
 [email protected]
-:metal:
+:metal:
diff --git a/sv2/training/README.md b/sv2/training/README.md
@@ -4,6 +4,19 @@ This is a guide for users to retrain the default SV<sup>2</sup> classifiers and
 
 Included in the SV<sup>2</sup> source package are the original training set and a jupyter notebook containing instructions for (re)training. 
 
+---
+
+## Methodology
+
+1. Get a training set
+    * [SV<sup>2</sup> Training Set](Training#sv2-training-set)
+    * [Generate your own training features](Training#custom-feature-extraction)
+2. [Train the models](Training#training-svm-classifiers)
+3. [Add your models to SV<sup>2</sup>](Training#adding-new-classifiers-to-sv2)
+4. [Genotype with your model](Training#genotyping-with-new-classifiers)
+
+---
+
 ## SV<sup>2</sup> Training Set
 
 The default training set is packaged with the source package:
@@ -20,6 +33,8 @@ $ ls sv2-VERSION/sv2/training/1kgp_training_data
 ```
 These files can be used for retraining in the [training SVM classifiers section](Training#training-svm-classifiers)
 
+---
+
 ## Custom Feature Extraction
 
 `sv2train` is a script designed for advanced users that wish to train genotyping classifiers with their own data. 
@@ -62,14 +77,17 @@ before training, users have to populate the values in `copy_number`. The expecte
 | 1           | 0 (REF)      |
 | 2           | 1 (DUP:ALT)  |
 
-The companion [jupyter notebook](https://github.com/dantaki/SV2/blob/master/sv2/training/sv2_training.ipynb) encodes genotype labels as copy number for simplicity. This is useful for users that wish to include variants with multiple alleles such as,
+The companion [jupyter notebook](https://github.com/dantaki/SV2/blob/master/sv2/training/sv2_training.ipynb) encodes genotype labels as copy number for simplicity. This is useful for users that wish to include variants with multiple alleles.
 
+#### Examples of multiallelic SVs
 | REF | ALT | Genotype | copy_number |
 | ----| --- | -------- | ----------- | 
 | \<CN1\> | \<CN0\>,\<CN2\>  | 2/2 | 4        |
 | \<CN1\> | \<CN0\>,\<CN2\>  | 1/2 | 2        |
 | \<CN1\> | \<CN2\>,\<CN3\>  | 0/2 | 4        |
 
+---
+
 ## Training SVM Classifiers
 
 The jupyter notebook is located in the source package here: `sv2-VERSION/sv2/training/sv2_training.ipynb`
@@ -84,6 +102,8 @@ The output of the jupyter notebook is a JSON file containing the paths to the tr
 
 **VERY IMPORTANT:bangbang:** do not alter the paths in the JSON file or the pickle files themselves.
 
+---
+
 ## Adding New Classifiers to SV<sup>2</sup>
 
 A JSON file containing paths to classifier models is required to add new classifiers. 
@@ -96,6 +116,8 @@ $ sv2 -load-clf myclf.json
 
 This command appends new classifiers to the SV<sup>2</sup> classifier JSON file located here: `$SV2_INSTALL_LOCATION/sv2/config/sv2_clf.json`
 
+---
+
 ## Genotyping with New Classifiers
 
 After loading the classifiers with the `-load-clf` command, users can specify which model to genotype with the `-clf <classifier-name>` option. 

diff --git a/sv2/training/sv2_training.ipynb b/sv2/training/sv2_training.ipynb
@@ -16,9 +16,9 @@
     "1. [Getting Started](#readme)\n",
     "\n",
     "2. [Input](#input)\n",
-    "\n",
-    "  * [Custom Input](#custominput)\n",
-    "  * [Generate Input with SV<sup>2</sup>Train](#sv2train)\n",
+    "  * Get Your Training Data\n",
+    "     * [SV<sup>2</sup> Training Data](#sv2default)\n",
+    "     * [Generate New Training Data](#sv2train)\n",
     "  * [Copy Number Input](#copynumber)\n",
     "\n",
     "3. [Name Your Classifer](#clf)\n",
@@ -71,8 +71,10 @@
    "metadata": {},
    "source": [
     "### This jupyter notebook is a guide for training the SVM genotyping classifers for SV<sup>2</sup>. <a class=\"anchor\" id=\"readme\"></a>\n",
-    "The following example will train classifers that **versions >= 1.2** of SV<sup>2</sup> uses.\n",
     "\n",
+    "This notebook is currently formatted to train the default classifiers that SV<sup>2</sup> **versions >=1.2** uses.\n",
+    "\n",
+    "The parameters for the classifier are the ones used for the default classifiers. To retrain the models, alter these values.\n",
     "\n",
     "Feel free to edit this document to train your own data. \n",
     "  * Classifiers will be saved as pickle (.pkl) files. \n",
@@ -87,8 +89,25 @@
    "source": [
     "## Input <a class=\"anchor\" id=\"input\"></a>\n",
     "\n",
+    "Get your training data\n",
+    "\n",
+    "### 1. Train using the default training set <a class=\"anchor\" id=\"sv2default\"></a>\n",
+    "\n",
+    "The default training set is located here\n",
+    "\n",
+    "```\n",
+    "$ ls SV2-INSTALL-PATH/sv2/training/1kgp_training_data/*\n",
+    "\n",
+    "    1kgp_highcov_del_gt1kb.txt  \n",
+    "    1kgp_highcov_del_lt1kb.txt  \n",
+    "    1kgp_highcov_del_malesexchrom.txt  \n",
+    "    1kgp_highcov_dup_snv.txt  \n",
+    "    1kgp_lowcov_dup_breakpoint.txt  \n",
+    "    1kgp_lowcov_dup_malesexchrom.txt\n",
+    "\n",
+    "```\n",
     "\n",
-    "### Generate training input with `sv2train` <a class=\"anchor\" id=\"sv2train\"></a>\n",
+    "### 2. Generate training input with `sv2train` <a class=\"anchor\" id=\"sv2train\"></a>\n",
     "\n",
     "```\n",
     "sv2train -i <sample_input.txt> -b <sv.bed ...> -v <sv.vcf ...> \n",
@@ -113,18 +132,6 @@
     "| :--- | ----- | --- | ---- |  --- | ---- | ------- | ------ | --- | ----------- | ---------- |\n",
     "| chr1 | 193646553 | 193654283 | DEL  |\tHG00096\t| 1.145 | 0.0 |\t0.0 | ... | **NA** |deletion_gt1kb |\n",
     " \n",
-    "\n",
-    "#### Header explaination:\n",
-    "  * covr = normalized depth of coverage feature\n",
-    "  * dpe_rat = normalized discordant-paired end feature\n",
-    "  * sr_rat = normalized split-read feature\n",
-    "  * copy_number = copy number/genotype\n",
-    "  * classifier = classifier to train on\n",
-    "\n",
-    "### Custom Input <a class=\"anchor\" id=\"custominput\"></a>\n",
-    "\n",
-    "Advanced users can train SV<sup>2</sup> classifiers with custom training sets. \n",
-    "\n",
     "This jupyter notebook will accept training data in a pandas DataFrame given the following column names\n",
     "\n",
     "| column | description | \n",
@@ -136,6 +143,7 @@
     "| copy_number | [copy number](#copynumber) |\n",
     "\n",
     "### Copy Number Input <a class=\"anchor\" id=\"copynumber\"></a> \n",
+    "\n",
     "sv2train does not produce genotypes, hence copy_number values are not available (**NA**). The user will have to supply genotypes for training custom data sets. \n",
     "\n",
     "The included default SV<sup>2</sup> training sets contain copy_number values\n",
@@ -547,7 +555,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": [
     "CLF_NAME = \"1kgp_lowcov_dup_malesexchrom_svm_clf.pkl\"\n",