-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Functionmotifs #116
Merged
Merged
Functionmotifs #116
Changes from 245 commits
Commits
Show all changes
258 commits
Select commit
Hold shift + click to select a range
a5ddb35
Fix wildcard error in motif.smk
tnitka b1e5c5d
Intentionally break motif before current error to check that changes …
tnitka 9b0d4c7
Fix most motif.smk errors, including intentional break
tnitka 508a01b
Fix wildcard and key errors
tnitka 9d91602
Fix undefined name in motif script
tnitka 83b90b7
Fix input error in motif
tnitka eca87b3
Fix logging error in motif script
tnitka b26a173
Fix input error in motif script
tnitka 5b17a81
Fix logging error in motif script
tnitka 80bf8d0
Fix input error in motif script
tnitka c10b240
Fix input error in motif script
tnitka 6aacb30
Fix input error in motif script
tnitka fcf8a2d
Fix compressed input handling in motif script
tnitka 16e1256
Fix compressed input handling in motif script
tnitka 5278048
Change input score source in motif script to the score matrix instead…
tnitka 2e312d3
Implement bad debugging practive to be reverted in next commit
tnitka cbafaff
Fix input handling and remove bad debugging practice
tnitka 7f8691e
Fix compressed input handling in motif script
tnitka 381a68b
Fix conflict between dataframe and ndarray usage in motif_motif.py
tnitka 77ea592
Use common basis for all proteins in motif
tnitka d1f07d2
Fix input error in motif snakefile
tnitka 7096a71
Fix input error in motif snakefile
tnitka 00438e7
Fix input error in motif script
tnitka 11e7d1d
Fix input processing error in motif
tnitka 43efb71
Fix input processing error in motif
tnitka 0422444
Fix input processing error in motif
tnitka 6e37b51
Fix scoring error in motif script
tnitka 4ef303f
Fix error in motif scoring
tnitka 2028438
Fix error in motif scoring
tnitka cd5951b
Fix error in motif scoring
tnitka 52086fa
Fix type error in motif script
tnitka 9774b51
Fix matrix shape error)
tnitka 3415f4b
Fix matrix shape error in motif script
tnitka ec3cc33
Fix input handling error in motif script
tnitka 0bd376f
Fix input handling error in motif script
tnitka 8b8a9b1
Fix input handling error in motif script
tnitka 308ddac
Correct input matrix column used for family labels
tnitka f9914d2
Correct input matrix column used for family labels
tnitka cbfd862
Fix input handling error in motif script
tnitka 39c7ed6
Fix input handling error in motif script
tnitka 74c197b
Fix input handling error in motif script
tnitka efb6477
Fix class error in motif script
tnitka e757935
Fix class error in motif script
tnitka 6ea40eb
Fix class error in motif script
tnitka bb77727
Fix class error in motif script
tnitka af8d11d
Fix issue with DataFrame not being converted to ndarray
tnitka 43168ff
Fix issue with DataFrame being converted to empty ndarray
tnitka 734485f
Fix typographical error in motif script that prevented code execution
tnitka addd685
Correct error in selecting data labels
tnitka 7d656fa
Fix type error in scoring parameters
tnitka 98192c5
Fix type error in scoring parameters
tnitka 32c61e4
Fix type error in scoring parameters
tnitka 117dea5
Fix type error in scoring parameters
tnitka ab563aa
Fix type error in scoring parameters
tnitka 84f1b4b
Fix type error in motif.py
tnitka d4f69a5
Fix error passing params to scorer
tnitka 9008d46
Fix error passing params to scorer
tnitka bce9b70
Fix error passing params to scorer
tnitka 4874b50
Fix error passing params to scorer
tnitka fc154b3
Fix error passing params to scorer
tnitka 5669148
Fix error passing params to scorer
tnitka 25e18aa
Fix error passing params to scorer
tnitka 1dbd556
Fix error in motif object
tnitka 535169f
Fix valueerror in motif script
tnitka 0bca526
Change motif object to use pandas dataframe instead of numpy ndarray
tnitka ead24fa
Fix formatting errors in motif
tnitka f371cf4
Fix formatting errors in motif
tnitka 38e87d5
Fix formatting errors in motif
tnitka 7f7c47d
Fix input error in motif script
tnitka 2f5558a
Fix input error in motif script
tnitka 4fee5ea
Fix input error in motif script
tnitka 31efcde
Fix error passing params to motif module
tnitka 75430ea
Fix error passing params to score module
tnitka 1357fb3
Fix error passing params to score module
tnitka 354ec59
Fix error fetching output from score module
tnitka 6afaa55
Fix error fetching output from score module
tnitka 150e6ac
Fix error tabulating permutation scores
tnitka 6e92538
Fix error tabulating permutation scores
tnitka d370955
Fix error tabulating permutation scores
tnitka 58b082c
Fix error tabulating permutation scores
tnitka d6af933
Fix error tabulating permutation scores
tnitka e76ebe0
Fix error tabulating permutation scores
tnitka 4dd3867
Fix error tabulating permutation scores
tnitka c59cb9c
Fix error tabulating permutation scores
tnitka 4b55fce
Fix error tabulating permutation scores
tnitka 66f3c7f
Fix error tabulating permutation scores
tnitka aaf38cd
Fix error tabulating permutation scores
tnitka 32a247f
Fix reference error in motif module
tnitka 2e0da3d
Fix reference error in motif module
tnitka 80bf596
Fix array dimension error in motif script
tnitka 0ab36d5
Fix array dimension error in motif script
tnitka 0eb3a7e
Fix array dimension error in motif script
tnitka 2749537
Fix array dimension error in motif script
tnitka bf25fd6
Fix array dimension error in motif script
tnitka e4848e3
Allow array concatenation in motif script
tnitka 69c7995
Fix array dimension error in motif script
tnitka 836c956
Fix array dimension error in motif script
tnitka 1f4f6c4
Fix array dimension error in motif script
tnitka 65df50a
Fix array dimension error in motif script
tnitka 3130c72
Change p_values class in motif module to use dataframes
tnitka 9911684
Fix indexerror in motif module
tnitka 70295b4
Fix type error in motif module
tnitka 3a4bd06
Fix type error in motif module
tnitka 6898f68
Fix type error in motif module
tnitka 09d21d4
Fix type error in motif module
tnitka dd07587
Fix type error in motif module
tnitka 6836b3b
Fix error uncompressing input
tnitka df278b1
Fix error uncompressing input
tnitka ab32afa
Fix error uncompressing input
tnitka 8e56cbc
Fix error uncompressing input
tnitka 270e91d
Fix error uncompressing input
tnitka 04d80e7
Fix error in p_values class
tnitka e14fcb6
Fix input data typing error
tnitka e999781
Fix typing error in motif module
tnitka 08056d3
Fix type error in motif
tnitka c1aabe7
Fix type error in motif
tnitka 5e13e59
Fix type error in motif
tnitka 79de7e9
Fix type error in motif
tnitka eea9dc7
Fix type error in motif
tnitka a896562
Fix type error in motif
tnitka 02e9393
Fix type error in motif
tnitka cdfa4e4
Fix type error in motif
tnitka a0b2ccd
Fix type error in motif
tnitka c34dacd
Fix type error in motif
tnitka aac2f03
Fix type error in motif
tnitka c5a0f90
Fix type error in motif
tnitka 59cf33c
Fix type error in motif
tnitka 0a67d8e
Fix type error in motif
tnitka 1c2d4b7
Fix type error in motif
tnitka e783d62
Fix type error in motif
tnitka 39b988a
Fix type error in motif
tnitka 7ff7acf
Remove redundant output
tnitka cb0aa78
Prevent snakemake from expecting redundant output
tnitka 870d4af
Fix error that causes empty output file
tnitka ce902bb
Fix error that causes empty output file
tnitka 210494c
Fix error that causes empty output file
tnitka 7ec82ba
Fix error that causes empty output file
tnitka 4118256
Fix error that causes empty output file
tnitka 24aa508
Fix error that causes empty output file
tnitka 56ddb0c
Fix error that causes empty output file
tnitka f50bb79
Fix error that causes empty output file
tnitka e6c4730
Fix error that causes empty output file
tnitka 0359abb
Fix error that causes empty output file
tnitka 9ba204f
Fix error that causes empty output file
tnitka 893bd89
Fix error preventing kmers from being included in output
tnitka 1882e9a
Fix error preventing kmers from being included in output
tnitka f43560b
Fix error preventing all kmers from being scored
tnitka 62e1964
Fix indexing errors
tnitka 3df7a20
Fix indexing errors
tnitka 7b3f280
Fix indexing errors
tnitka e33699f
Fix indexing errors
tnitka 87e0034
Fix indexing errors
tnitka 641207e
Fix indexing errors
tnitka 8addb07
Fix indexing errors
tnitka c3a4d51
Fix indexing errors
tnitka 35f771c
Fix indexing errors
tnitka 274c3af
Fix indexing errors
tnitka 90df931
Explicitly name columns to fix KeyError in motif
tnitka 54f0190
Explicitly name columns to fix KeyError in motif
tnitka f3a24b6
Explicitly name columns to fix KeyError in motif
tnitka 99759ff
Explicitly name columns to fix KeyError in motif
tnitka c260458
Fix issue selecting kmer sequence in motif
tnitka 311bb6e
Fix issue selecting kmer sequence in motif
tnitka 0f54a89
Change iteration from DataFrame to NDArray
tnitka 7aad4fe
Change iteration from DataFrame to NDArray
tnitka 10d71c8
Change iteration from DataFrame to NDArray
tnitka b6e7625
Change iteration from DataFrame to NDArray
tnitka b7eb7e3
Change iteration from DataFrame to NDArray
tnitka 37de91d
Change iteration from DataFrame to NDArray
tnitka ce1ac40
Fix error fetching kmer scores from score
tnitka f8cdb37
Fix error fetching kmer scores from score
tnitka 250f15b
Fix error fetching kmer scores from score
tnitka 141c95f
Fix type error in p value calculation
tnitka 1539238
Remove redundant code
tnitka c62212b
Fix issue in motif rule that was sometimes causing MissingInputException
tnitka be64178
Ensure that all permutation scores are compared to the real score whe…
tnitka c403b28
Ensure that all permutation scores are compared to the real score whe…
tnitka ba857c5
Ensure that all permutation scores are compared to the real score whe…
tnitka 7512bb7
Add scores as output from motif
tnitka aa80508
Remove score output from motif due to I/O issues
tnitka 2fcf54b
Remove score output from motif due to I/O issues
tnitka d07a9b5
Temporarily print permutation scores to check whether they are identical
tnitka fec04e5
Remove printing as it is no longer necessary
tnitka 710cda1
Remove unnecessary conversions in motif script
tnitka 89e59f3
Fix syntax in motif script
tnitka 1cf25a4
Add scores as output from motif
tnitka eb8d2b7
Fix issue causing scores to be the same across iterations in motif
tnitka e8be1e2
Remove unused code
tnitka abb0a70
Change motif output to be sorted by p value
tnitka a22fe90
Fix motif output sorting order to put most significant kmers first
tnitka b3897ea
Fix issue parsing kmers when called with k=2
tnitka 9ccf635
Fix motif output sorting order to put most significant kmers first
tnitka efd7f02
Change output sorting to use p value first followed by score on real …
tnitka 9cb9834
Fix motif output sorting order to put highest scoring kmers first for…
tnitka 1f7a838
Fix formatting of csv containing scores from motif iterations
tnitka eab5490
Fix issue causing motif to read kmer NA as np.nan
tnitka 99e1526
Expand definition of false positives in motif to scores greater than …
tnitka dcea2f6
Fix issue causing too few scoring iterations to be compared in motif
tnitka 39aa1e5
Remove minimum family size from motif
tnitka c0631c1
Remove unnecessary family size check from motif snakefile
tnitka 04d38bd
Fix issue sorting negative scores in motif
tnitka 69e2624
Fix issue sorting kmers with negative weights in motif output
tnitka 6f58027
Parallelize the motif workflow
tnitka 333822c
Fix error sorting output of motif workflow
tnitka 788d086
Remove redundant code from motif workflow to slightly reduce memory u…
tnitka 16efac9
Reduce memory usage in motif and model
tnitka bf8dfb4
Further reduce memory usage in model
tnitka e5f4bd4
Fix an issue affecting rescoring results
tnitka 9496699
Reduce memory usage in motif
tnitka 670819a
Reduce memory usage of motif module
tnitka 17a24da
Reduce peak memory usage during motif
tnitka c875698
Fix error normalizing kmer weights in motif
tnitka 99cd156
Fix error normalizing kmer weights in motif
tnitka 4af188e
Fix error normalizing kmer weights in motif
tnitka d76238b
Fix error normalizing kmer weights in motif
tnitka fb615b0
Fix error normalizing scores in motif
tnitka 9a52700
Add preselection step to motif workflow
tnitka f57d8c0
Fix score scaling error in motif
tnitka d049f5f
Fix error in motif preselection step
tnitka 2cef422
Fix error calculating p-values in motif
tnitka dfff88f
Apply recursive feature elimination during motif preselection
tnitka 121df5f
Adjust motif preselection stopping criterion to improve results
tnitka f368ea7
Decrease RFE step size in motif preselection step
tnitka d8d0e06
Remove redundant code
tnitka 9bc50c2
fixup! Remove redundant code
tnitka 76410b4
Merge branch 'main' into functionmotifs
tnitka bd0b268
Format/lint code
tnitka 96d88d3
Format/lint motif snakefile
tnitka c195a9d
Fix whitespace error in snakefmt output
tnitka 2eba304
Format/lint snakefile
tnitka 2507498
update action.yml
tnitka 3c54509
Update test and fix command line parser error introduced during rebase
tnitka 4e81aee
fixup! Update test and fix command line parser error introduced durin…
tnitka 7383c42
Add motif test to CI workflow
tnitka 2d5cae5
Correct snekmer motif test environment
tnitka 8de742d
chore: update _version.py
tnitka 46b526c
Merge branch 'main' into functionmotifs
tnitka 8db7a27
Add Motif tutorial
tnitka 22f0a02
Update docs to include motif
tnitka 4c4ad19
Add model from RFE as output in motif
tnitka 9c47a7a
Update documentation for motif
tnitka aa1c11d
Fix formatting
tnitka 6924e9e
Add motif report output
tnitka 9f72ba1
docs: add demo pages for learn/apply and motif
christinehc 56f6607
docs: clean up files and update/create symlinks
christinehc 4ff1b0a
Fix formatting
tnitka cdb04f7
Remove redundant code from motif result script
tnitka 4d86439
move Motif tutorial into separate directory with more informative con…
tnitka a57cb02
Add motif to README.md
tnitka 64f4fef
Merge branch 'main' into functionmotifs
tnitka File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,4 +48,6 @@ score_dir: "output/example-model/" | |
learnapp: | ||
save_apply_associations: False | ||
|
||
|
||
# motif params | ||
motif: | ||
n: 200 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,4 +11,4 @@ scikit-learn | |
tabulate == 0.8.10 | ||
umap-learn | ||
hdbscan | ||
pyarrow | ||
pyarrow |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
__version__ = "1.2.0" | ||
__version__ = "1.3.0" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
christinehc marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
"""motif: Identification of structurally and functionally relevant motifs with Snekmer. | ||
Created on Fri Apr 21 15:25:54 2023 | ||
|
||
author: @tnitka | ||
""" | ||
# --------------------------------------------------------- | ||
# Imports | ||
# --------------------------------------------------------- | ||
# import pickle | ||
# from datetime import datetime | ||
|
||
# import snekmer as skm | ||
import pandas as pd | ||
import numpy as np | ||
# import snekmer.motif | ||
# from typing import Any, Dict, List, Optional | ||
# from ._version import __version__ | ||
# from .vectorize import KmerBasis | ||
# from .score import KmerScorer | ||
# from .model import SnekmerModel, SnekmerModelCV | ||
#from numpy.typing import NDArray | ||
# from sklearn.base import BaseEstimator, ClassifierMixin | ||
# from sklearn.tree import DecisionTreeClassifier | ||
# from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier | ||
# from sklearn.linear_model import LogisticRegression # LogisticRegressionCV | ||
# from sklearn.model_selection import GridSearchCV, cross_validate | ||
# from sklearn.pipeline import make_pipeline, Pipeline | ||
# from sklearn.svm import SVC | ||
|
||
# object to permute training data and retrain | ||
class SnekmerMotif: | ||
"""Permute training data and retrain to find highly distinguishing kmers. | ||
|
||
Parameters | ||
---------- | ||
n : int | ||
Number of permutations to test. | ||
scores : NDArray | ||
""" | ||
def __init__(self): | ||
self.generator = np.random.default_rng() | ||
# self.scorer = skm.score.KmerScorer() | ||
|
||
def permute(self, X: pd.DataFrame, label, label_col="family"): | ||
""" | ||
|
||
Parameters | ||
---------- | ||
X : Dataframe containing matrix of shape (n_kmers, n_features) | ||
Labeled training data. | ||
label : str | ||
Primary family label. | ||
label_col : str | ||
Column with family labels. | ||
|
||
Returns | ||
------- | ||
Dataframe | ||
Training data with permuted labels, for retraining and rescoring. | ||
|
||
""" | ||
# save primary family label | ||
self.primary_label = label | ||
self.labels = X[label_col].values | ||
|
||
self.generator.shuffle(self.labels) | ||
# self.permuted_labels = self.generator.permutation(self.labels) | ||
# self.permuted_data = X | ||
X[label_col] = self.labels | ||
|
||
return X | ||
|
||
def p_values(self, X, y: np.ndarray, n: int): | ||
""" | ||
|
||
Parameters | ||
---------- | ||
X: Dataframe containing matrix of shape (n_kmers, n_iterations) | ||
kmer scores from each permutation tested | ||
y: list or array-like of shape (n_kmers, 1) | ||
kmer scores from real training data | ||
n: int | ||
number of permutations tested | ||
|
||
Returns | ||
------- | ||
Dataframe | ||
matrix containing kmer sequences, scores on real data, number of scores | ||
on permuted data that exceed that on real data, n_iterations, and | ||
proportion of scores on permuted data that exceed that on real data. | ||
|
||
""" | ||
# self.output = pd.DataFrame(columns=('kmer', 'real score', 'false positives', 'n', 'p')) | ||
self.output_matrix = np.empty((1, 5)) | ||
for i in range(0, len(y)-1): | ||
self.seq = X['kmer'].iloc[i] | ||
self.real_score = y[i] | ||
self.false_score = X.iloc[i, 1:(n+1)].ge(self.real_score).sum() | ||
self.p = self.false_score/n | ||
self.vec = np.array([[self.seq, self.real_score, self.false_score, n, self.p]]) | ||
self.output_matrix = np.append(self.output_matrix, self.vec, axis=0) | ||
|
||
|
||
else: | ||
self.output_matrix = np.delete(self.output_matrix, 0, 0) | ||
|
||
self.output = pd.DataFrame(self.output_matrix, columns=('kmer', 'real score', 'false positives', 'n', 'p')) | ||
|
||
return self.output |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why this was changed for local testing, but once we merge in the PR, this should be changed to not point to the functionmotifs branch anymore