Skip to content

Commit

Permalink
added documentation in the training notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
dantaki committed Sep 15, 2017
1 parent 595d333 commit 8c4336c
Show file tree
Hide file tree
Showing 3 changed files with 52 additions and 20 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,4 +191,4 @@ SOFTWARE.
## Contact
:mailbox:
[email protected]
:metal:
:metal:
24 changes: 23 additions & 1 deletion sv2/training/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,19 @@ This is a guide for users to retrain the default SV<sup>2</sup> classifiers and

Included in the SV<sup>2</sup> source package are the original training set and a jupyter notebook containing instructions for (re)training.

---

## Methodology

1. Get a training set
* [SV<sup>2</sup> Training Set](Training#sv2-training-set)
* [Generate your own training features](Training#custom-feature-extraction)
2. [Train the models](Training#training-svm-classifiers)
3. [Add your models to SV<sup>2</sup>](Training#adding-new-classifiers-to-sv2)
4. [Genotype with your model](Training#genotyping-with-new-classifiers)

---

## SV<sup>2</sup> Training Set

The default training set is packaged with the source package:
Expand All @@ -20,6 +33,8 @@ $ ls sv2-VERSION/sv2/training/1kgp_training_data
```
These files can be used for retraining in the [training SVM classifiers section](Training#training-svm-classifiers)

---

## Custom Feature Extraction

`sv2train` is a script designed for advanced users that wish to train genotyping classifiers with their own data.
Expand Down Expand Up @@ -62,14 +77,17 @@ before training, users have to populate the values in `copy_number`. The expecte
| 1 | 0 (REF) |
| 2 | 1 (DUP:ALT) |

The companion [jupyter notebook](https://github.com/dantaki/SV2/blob/master/sv2/training/sv2_training.ipynb) encodes genotype labels as copy number for simplicity. This is useful for users that wish to include variants with multiple alleles such as,
The companion [jupyter notebook](https://github.com/dantaki/SV2/blob/master/sv2/training/sv2_training.ipynb) encodes genotype labels as copy number for simplicity. This is useful for users that wish to include variants with multiple alleles.

#### Examples of multiallelic SVs
| REF | ALT | Genotype | copy_number |
| ----| --- | -------- | ----------- |
| \<CN1\> | \<CN0\>,\<CN2\> | 2/2 | 4 |
| \<CN1\> | \<CN0\>,\<CN2\> | 1/2 | 2 |
| \<CN1\> | \<CN2\>,\<CN3\> | 0/2 | 4 |

---

## Training SVM Classifiers

The jupyter notebook is located in the source package here: `sv2-VERSION/sv2/training/sv2_training.ipynb`
Expand All @@ -84,6 +102,8 @@ The output of the jupyter notebook is a JSON file containing the paths to the tr

**VERY IMPORTANT:bangbang:** do not alter the paths in the JSON file or the pickle files themselves.

---

## Adding New Classifiers to SV<sup>2</sup>

A JSON file containing paths to classifier models is required to add new classifiers.
Expand All @@ -96,6 +116,8 @@ $ sv2 -load-clf myclf.json

This command appends new classifiers to the SV<sup>2</sup> classifier JSON file located here: `$SV2_INSTALL_LOCATION/sv2/config/sv2_clf.json`

---

## Genotyping with New Classifiers

After loading the classifiers with the `-load-clf` command, users can specify which model to genotype with the `-clf <classifier-name>` option.
Expand Down
46 changes: 28 additions & 18 deletions sv2/training/sv2_training.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@
"1. [Getting Started](#readme)\n",
"\n",
"2. [Input](#input)\n",
"\n",
" * [Custom Input](#custominput)\n",
" * [Generate Input with SV<sup>2</sup>Train](#sv2train)\n",
" * Get Your Training Data\n",
" * [SV<sup>2</sup> Training Data](#sv2default)\n",
" * [Generate New Training Data](#sv2train)\n",
" * [Copy Number Input](#copynumber)\n",
"\n",
"3. [Name Your Classifer](#clf)\n",
Expand Down Expand Up @@ -71,8 +71,10 @@
"metadata": {},
"source": [
"### This jupyter notebook is a guide for training the SVM genotyping classifers for SV<sup>2</sup>. <a class=\"anchor\" id=\"readme\"></a>\n",
"The following example will train classifers that **versions >= 1.2** of SV<sup>2</sup> uses.\n",
"\n",
"This notebook is currently formatted to train the default classifiers that SV<sup>2</sup> **versions >=1.2** uses.\n",
"\n",
"The parameters for the classifier are the ones used for the default classifiers. To retrain the models, alter these values.\n",
"\n",
"Feel free to edit this document to train your own data. \n",
" * Classifiers will be saved as pickle (.pkl) files. \n",
Expand All @@ -87,8 +89,25 @@
"source": [
"## Input <a class=\"anchor\" id=\"input\"></a>\n",
"\n",
"Get your training data\n",
"\n",
"### 1. Train using the default training set <a class=\"anchor\" id=\"sv2default\"></a>\n",
"\n",
"The default training set is located here\n",
"\n",
"```\n",
"$ ls SV2-INSTALL-PATH/sv2/training/1kgp_training_data/*\n",
"\n",
" 1kgp_highcov_del_gt1kb.txt \n",
" 1kgp_highcov_del_lt1kb.txt \n",
" 1kgp_highcov_del_malesexchrom.txt \n",
" 1kgp_highcov_dup_snv.txt \n",
" 1kgp_lowcov_dup_breakpoint.txt \n",
" 1kgp_lowcov_dup_malesexchrom.txt\n",
"\n",
"```\n",
"\n",
"### Generate training input with `sv2train` <a class=\"anchor\" id=\"sv2train\"></a>\n",
"### 2. Generate training input with `sv2train` <a class=\"anchor\" id=\"sv2train\"></a>\n",
"\n",
"```\n",
"sv2train -i <sample_input.txt> -b <sv.bed ...> -v <sv.vcf ...> \n",
Expand All @@ -113,18 +132,6 @@
"| :--- | ----- | --- | ---- | --- | ---- | ------- | ------ | --- | ----------- | ---------- |\n",
"| chr1 | 193646553 | 193654283 | DEL |\tHG00096\t| 1.145 | 0.0 |\t0.0 | ... | **NA** |deletion_gt1kb |\n",
" \n",
"\n",
"#### Header explaination:\n",
" * covr = normalized depth of coverage feature\n",
" * dpe_rat = normalized discordant-paired end feature\n",
" * sr_rat = normalized split-read feature\n",
" * copy_number = copy number/genotype\n",
" * classifier = classifier to train on\n",
"\n",
"### Custom Input <a class=\"anchor\" id=\"custominput\"></a>\n",
"\n",
"Advanced users can train SV<sup>2</sup> classifiers with custom training sets. \n",
"\n",
"This jupyter notebook will accept training data in a pandas DataFrame given the following column names\n",
"\n",
"| column | description | \n",
Expand All @@ -136,6 +143,7 @@
"| copy_number | [copy number](#copynumber) |\n",
"\n",
"### Copy Number Input <a class=\"anchor\" id=\"copynumber\"></a> \n",
"\n",
"sv2train does not produce genotypes, hence copy_number values are not available (**NA**). The user will have to supply genotypes for training custom data sets. \n",
"\n",
"The included default SV<sup>2</sup> training sets contain copy_number values\n",
Expand Down Expand Up @@ -547,7 +555,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"CLF_NAME = \"1kgp_lowcov_dup_malesexchrom_svm_clf.pkl\"\n",
Expand Down

0 comments on commit 8c4336c

Please sign in to comment.