Skip to content

Commit

Permalink
Update docs to include motif
Browse files Browse the repository at this point in the history
  • Loading branch information
tnitka committed Feb 1, 2024
1 parent 8db7a27 commit 22f0a02
Show file tree
Hide file tree
Showing 5 changed files with 214 additions and 16 deletions.
6 changes: 3 additions & 3 deletions docs/source/getting_started/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ For an overview of Snekmer usage, reference the help command (``snekmer --help``
.. code-block:: console
$ snekmer --help
usage: snekmer [-h] [-v] {cluster,model,search,learn,apply} ...
usage: snekmer [-h] [-v] {cluster,model,search,learn,apply,motif} ...
Snekmer: A tool for kmer-based sequence analysis using amino acid reduction (AAR)
Expand All @@ -26,7 +26,7 @@ For an overview of Snekmer usage, reference the help command (``snekmer --help``
mode:
Snekmer mode
{cluster,model,search,learn,apply}
{cluster,model,search,learn,apply,motif}
Tailored references for the individual operation modes can be accessed
via ``snekmer {mode} --help``.
Expand All @@ -49,7 +49,7 @@ files. Snekmer also assumes background files, if any, are stored in
is shown below:


Snekmer ``cluster``, ``model``, and ``search`` input
Snekmer ``cluster``, ``model``, ``search``, and ``motif`` input

.. code-block:: console
Expand Down
9 changes: 9 additions & 0 deletions docs/source/getting_started/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -131,3 +131,12 @@ General parameters related to Snekmer's learn and apply mode (``snekmer learn``,
``seed`` ``int`` Choose any (random) seed for reproducible fragmentation.
============================= ===================== =========================================================================


Motif Parameters
````````````````
The following parameters are required for Snekmer's motif mode (``snekmer motif``), wherein feature selection is performed to find functionally relevant kmers.

======================== ===================== ==================================================================================
Parameter Type Description
======================== ===================== ==================================================================================
``n`` ``int`` Number of label permutation and rescoring iterations to run for each input family.
27 changes: 27 additions & 0 deletions docs/source/getting_started/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -233,3 +233,30 @@ and directories in addition to the files described previously.
│ │ ├── Seq-Annotation-Scores-D.csv # (optional) Sequence-annotation cosine similarity scores for D seqs
│ │ ├── kmer-summary-C.csv # Results with annotation predictions and confidence for C seqs
│ │ └── kmer-summary-D.csv # Results with annotation predictions and confidence for D seqs
Snekmer Motif Output Files
::::::::::::::::::::::::::

Snekmer's motif mode produces the following output files and directories in addition to the files described previously.

.. code-block:: console
.
├── output/
│ ├── ...
│ ├── motif/
│ │ ├── kmers/
│ │ │ ├── A.csv # kmers retained for A after recursive feature elimination
│ │ │ ├── B.csv # kmers retained for B after recursive feature elimination
│ │ ├── preselection/
│ │ │ ├── A.csv # kmer weights learned for A after recursive feature elimination
│ │ │ ├── B.csv # kmer weights learned for B after recursive feature elimination
│ │ ├── sequences/
│ │ │ ├── A.csv # Sequence vectors for A using the kmer subset retained after recursive feature elimination
│ │ │ ├── B.csv # Sequence vectors for B using the kmer subset retained after recursive feature elimination
│ │ ├── scores/
│ │ │ ├── A.csv # kmer weight learned for A on each permute/rescore iteration
│ │ │ ├── B.csv # kmer weight learned for B on each permute/rescore iteration
│ │ ├── p_values/
│ │ │ ├── A.csv # Tabulated results for A
│ │ │ └── B.csv # Tabulated results for B
4 changes: 3 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ sequences to predict the nearest annotation and generate a confidence score.
:width: 700
:alt: Snekmer workflow overview

There are 5 operation modes for Snekmer: ``cluster``, ``model``, ``search``, ``learn``, and ``apply``.
There are 6 operation modes for Snekmer: ``cluster``, ``model``, ``search``, ``motif``, ``learn``, and ``apply``.

**Cluster mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA).
Snekmer applies the relevant workflow steps and outputs the resulting clustering results in tabular form (.CSV),
Expand All @@ -34,6 +34,8 @@ displays K-fold cross validation results in the form of figures (AUC ROC and PR
and the models they wish to search their sequences against. Snekmer applies the relevant workflow steps
and outputs a table for each file containing model annotation probabilities for the given sequences.

**Motif mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA). Snekmer applies the relevant workflow steps and outputs a table (.csv) for each family, which shows the SVM weight and associated p-value for each kmer.


**Learn mode:** The user supplies files containing sequences in an appropriate format (e.g. FASTA) as well as an annotation file. Snekmer generates a kmer counts matrix with the summed kmer distribution of each annotation recognized from the sequence ID. Snekmer then performs a self-evaluation to assess confidence levels. There are two outputs, a counts matrix, and a global confidence distribution.

Expand Down
184 changes: 172 additions & 12 deletions resources/tutorial/snekmer_motif_tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -84,16 +84,18 @@
"\n",
"If you haven't yet run the [Snekmer tutorial](https://snekmer.readthedocs.io/en/latest/tutorial/index.html), you'll need to do so now. This runs Motif (and the original three Snekmer modes) on the demo example files and produces all output files. The tutorial uses the included default configuration parameters to guide the analysis, but the user can modify these parameters if a different configuration set is desired. The tutorial command line instructions are copied below:\n",
"\n",
"'''bash\n",
"conda activate snekmer\n",
" cd resources/tutorial/demo_example\n",
" ./run_demo.sh\n",
"```\n",
"\n",
"Finally, we will initialize some parameters and parse filenames for this demo notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"id": "7228680b-722a-4cdb-ba2b-181bae3e2a72",
"metadata": {},
"outputs": [],
Expand All @@ -111,10 +113,18 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 4,
"id": "cbf3925a-9ade-4b17-b7bb-12f11333f140",
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'k': 14, 'alphabet': 0, 'input_file_exts': ['fasta', 'fna', 'faa', 'fa'], 'input_file_regex': '.*', 'nested_output': False, 'score': {'scaler': True, 'scaler_kwargs': {'n': 0.25}, 'labels': 'None', 'lname': 'None'}, 'cluster': {'method': 'agglomerative-jaccard', 'params': {'n_clusters': 'None', 'linkage': 'average', 'distance_threshold': 0.92, 'compute_full_tree': True}, 'cluster_plots': False, 'min_rep': None, 'max_rep': None, 'save_matrix': True, 'dist_thresh': 100}, 'model': {'cv': 5, 'random_state': 'None'}, 'model_dir': 'output/model/', 'basis_dir': 'output/kmerize/', 'score_dir': 'output/score/', 'motif': {'n': 2000}}\n"
]
}
],
"source": [
"# load config file\n",
"with open(os.path.join(\"..\", \"..\", \"resources\", \"config.yaml\"), \"r\") as configfile:\n",
Expand All @@ -125,10 +135,18 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 5,
"id": "a9fecdc8-7644-40ca-bd28-b381afe92a88",
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['TIGR03149', 'nxrA']\n"
]
}
],
"source": [
"filenames = sorted(\n",
" [\n",
Expand Down Expand Up @@ -168,15 +186,157 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 4,
"id": "aba5c403-f3ca-4978-81f4-efcced6b6810",
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>kmer</th>\n",
" <th>real score</th>\n",
" <th>false positives</th>\n",
" <th>n</th>\n",
" <th>p</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>VSVSSVSVVSVSSV</td>\n",
" <td>1.000000</td>\n",
" <td>0</td>\n",
" <td>2000</td>\n",
" <td>0.0000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>SSSVSSSSSSSSSS</td>\n",
" <td>0.918053</td>\n",
" <td>0</td>\n",
" <td>2000</td>\n",
" <td>0.0000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>SSSVSVSSSVSVSV</td>\n",
" <td>0.905275</td>\n",
" <td>0</td>\n",
" <td>2000</td>\n",
" <td>0.0000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>VSSSSVSSVSSSVS</td>\n",
" <td>0.905275</td>\n",
" <td>0</td>\n",
" <td>2000</td>\n",
" <td>0.0000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>SSSSVSSVSSSVSV</td>\n",
" <td>0.905275</td>\n",
" <td>0</td>\n",
" <td>2000</td>\n",
" <td>0.0000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1488</th>\n",
" <td>SSVVSVSSSVSVVS</td>\n",
" <td>-0.051628</td>\n",
" <td>1988</td>\n",
" <td>2000</td>\n",
" <td>0.9940</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1489</th>\n",
" <td>VVVSSVSSSVVVVS</td>\n",
" <td>-0.198998</td>\n",
" <td>1991</td>\n",
" <td>2000</td>\n",
" <td>0.9955</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1490</th>\n",
" <td>SVVSVSSSVSVVSV</td>\n",
" <td>-0.101570</td>\n",
" <td>2000</td>\n",
" <td>2000</td>\n",
" <td>1.0000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1491</th>\n",
" <td>VSSSSSSSSSVSVV</td>\n",
" <td>-0.183426</td>\n",
" <td>2000</td>\n",
" <td>2000</td>\n",
" <td>1.0000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1492</th>\n",
" <td>SSSSSSSSSVSVVS</td>\n",
" <td>-0.183426</td>\n",
" <td>2000</td>\n",
" <td>2000</td>\n",
" <td>1.0000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1493 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" kmer real score false positives n p\n",
"0 VSVSSVSVVSVSSV 1.000000 0 2000 0.0000\n",
"1 SSSVSSSSSSSSSS 0.918053 0 2000 0.0000\n",
"2 SSSVSVSSSVSVSV 0.905275 0 2000 0.0000\n",
"3 VSSSSVSSVSSSVS 0.905275 0 2000 0.0000\n",
"4 SSSSVSSVSSSVSV 0.905275 0 2000 0.0000\n",
"... ... ... ... ... ...\n",
"1488 SSVVSVSSSVSVVS -0.051628 1988 2000 0.9940\n",
"1489 VVVSSVSSSVVVVS -0.198998 1991 2000 0.9955\n",
"1490 SVVSVSSSVSVVSV -0.101570 2000 2000 1.0000\n",
"1491 VSSSSSSSSSVSVV -0.183426 2000 2000 1.0000\n",
"1492 SSSSSSSSSVSVVS -0.183426 2000 2000 1.0000\n",
"\n",
"[1493 rows x 5 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# read motif results\n",
"results = pd.read_csv(os.path.join(\"demo_example\", \"output\", \"motif\", \"p_values\", \"nxrA.csv\"))\n",
"results = results.sort_values(by=\"motif\").reset_index(drop=True)\n",
"results[\"motif\"] = results[\"motif\"].astype(str)\n",
"results"
]
},
Expand All @@ -191,9 +351,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "snekmer",
"language": "python",
"name": "python3"
"name": "snekmer"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -205,7 +365,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.1"
"version": "3.10.5"
}
},
"nbformat": 4,
Expand Down

0 comments on commit 22f0a02

Please sign in to comment.