Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Feature/111/pipeline functions #157

Merged
merged 58 commits into from
Jun 4, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
9cba46d
Migration of file used in tallamjr/plasticc repo
tallamjr May 8, 2019
dd8277d
Minor linting improvements + comments
tallamjr May 8, 2019
520150d
Changing mode of file
tallamjr May 8, 2019
2acd69a
Renaming functions to be inline with code style
tallamjr May 10, 2019
c898f96
Tidying up file and renaming function names
tallamjr May 14, 2019
660519a
Change mode of run_plasticc_pipeline file to 744
tallamjr May 14, 2019
906d95e
Updating create_folder_structure function
tallamjr May 14, 2019
e40a785
Removing options in config to be in script instead
tallamjr May 14, 2019
cc84318
Moving old utils files to an archival folder
tallamjr May 14, 2019
edfd84d
Tidy up import block
tallamjr May 14, 2019
98e8800
[WIP] Updating functions in pipeline script
tallamjr May 14, 2019
2992782
[WIP] Further updates to pipeline script
tallamjr May 14, 2019
8bd050f
Modifying file structure inside utils directory
tallamjr May 14, 2019
3e85f5b
Updating configuration file
tallamjr May 14, 2019
4179b2c
Append git has to analysis name
tallamjr May 15, 2019
32eb2eb
Updating variable names to be consistent
tallamjr May 17, 2019
cdf659d
Updating new var name to be consistant with gps.py
tallamjr May 17, 2019
5fb499e
Reducing number of PCA components
tallamjr May 18, 2019
8bb380c
Adding None return if key-value not found
tallamjr May 18, 2019
429355b
Removing unnecessary print statements
tallamjr May 18, 2019
f84d22b
Adding timestamp helper function
tallamjr May 18, 2019
be45b59
Fixes Type error: can't concat str to bytes
tallamjr May 18, 2019
fa0373e
Updating path to features directory for wavelets
tallamjr May 20, 2019
d16bc3e
Fixing spelling error for 'Principal' in PCA
tallamjr May 20, 2019
699b902
Converting wavelet features to pandas dataframe
tallamjr May 20, 2019
27a0f6e
Updating confusion matrix functions
tallamjr May 20, 2019
2b19bab
Updates made to 'create_classifier' functions
tallamjr May 20, 2019
a267363
Save SHA and timestamp inside copy of config file
tallamjr May 21, 2019
3ad79b4
Remove unused function argument
tallamjr May 21, 2019
9c8d870
Updating docstrings
tallamjr May 21, 2019
d2dd843
Adding _to_pandas() helper functions
tallamjr May 21, 2019
0e4fe56
Adding roc/auc metrics to create_classifier()
tallamjr May 21, 2019
83f91f4
Fixing error of now new folder being created
tallamjr May 21, 2019
c5593d5
Updating gitignore
tallamjr May 21, 2019
382a251
Updating save_configuration_file function
tallamjr May 22, 2019
8ad52aa
Adding option to save wavelet features to disk
tallamjr May 22, 2019
98867b9
Adding option to restart from saved wavelets
tallamjr May 22, 2019
80971f2
Moving restart option to its own function call
tallamjr May 22, 2019
42486c8
Return wavelet_components as a pandas DataFrame
tallamjr May 22, 2019
d50ec44
Rearrange imports to be PEP8 compliant
tallamjr May 22, 2019
ae8d50e
Updating variable name
tallamjr May 22, 2019
89e3bf5
Chaning file that logs parameters to be appending
tallamjr May 22, 2019
91d84a6
This will open file for reading/writing (updating)
tallamjr May 22, 2019
7e281e2
Fixing typo in saving and reading pickled df
tallamjr May 22, 2019
e32f65e
Including 'imbalanced-learn' package as dependency
tallamjr May 29, 2019
47f6125
Return figure aswell as confusion matrix from func
tallamjr May 29, 2019
3242bf3
Adding functionality to rebalance classes
tallamjr May 29, 2019
cecb4ac
Fix a path bug
Catarina-Alves May 29, 2019
080434b
Fix a method call
Catarina-Alves May 29, 2019
2c86dc7
Updating variable name, ncomp --> number_comp
tallamjr Jun 2, 2019
0eb5572
[FIXUP] Updating variable name, ncomp
tallamjr Jun 2, 2019
d07bbdc
Adding 'get_directories()' function
tallamjr Jun 2, 2019
854ebc2
[FIXUP] Adding debug print statement
tallamjr Jun 3, 2019
d32ca95
Updating docstrings
tallamjr Jun 3, 2019
89caf3c
Updating variable name, dirs --> directories
tallamjr Jun 3, 2019
33cffea
Fixing version of sncosmo for debug checks
tallamjr Jun 4, 2019
13fb8b6
Save the balancing method and the number of PCA components used for t…
Catarina-Alves Jun 4, 2019
c251337
Bump version 1.3.2 --> 1.4.0
tallamjr Jun 4, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@
test/*
!test/*.py

# Do not track log files in utils
utils/*stdout.txt

## Python.gitignore from Github.
##
# Byte-compiled / optimized / DLL files
Expand Down
5 changes: 3 additions & 2 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,18 @@ dependencies:
- jupyter>=1.0.0
- matplotlib>=1.5.1
- numpy=1.12.0
- scikit-learn=0.18.1
- scikit-learn>=0.20
- scipy>=0.17.0
- george>=0.3.0
- iminuit>=1.2
- pandas>=0.23.0
- extinction>=0.3.0
- imbalanced-learn>=0.4.3

- pip:
- emcee>=2.1.0
- numpydoc>=0.6.0
- pywavelets>=0.4.0
- sncosmo>=1.3.0
- sncosmo==1.7.1
- nose>=1.3.7
- future>=0.16
6 changes: 3 additions & 3 deletions snmachine/gps.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ def compute_gps(dataset, number_gp, t_min, t_max, kernel_param=[500., 20.], outp
output_root : {None, str}, optional
If None, don't save anything. If str, it is the output directory, so save the flux and error estimates and used kernels there.
number_processes : int, optional
Number of processors to use for parallelisation (shared memory only). By default `nprocesses` = 1.
Number of processors to use for parallelisation (shared memory only). By default `number_processes` = 1.
gp_algo : str, optional
which gp package is used for the Gaussian Process Regression, GaPP or george
"""
Expand Down Expand Up @@ -148,7 +148,7 @@ def _compute_gps_parallel(dataset, number_gp, t_min, t_max, kernel_param, output
output_root : {None, str}, optional
If None, don't save anything. If str, it is the output directory, so save the flux and error estimates and used kernels there.
number_processes : int, optional
Number of processors to use for parallelisation (shared memory only). By default `nprocesses` = 1.
Number of processors to use for parallelisation (shared memory only). By default `number_processes` = 1.
gp_algo : str, optional
which gp package is used for the Gaussian Process Regression, GaPP or george
"""
Expand Down Expand Up @@ -413,4 +413,4 @@ def get_kernel(kernel_name, kernel_param):
elif kernel_name == 'ExpSquared+ExpSine2':
kExpSine2 = kernel_param[4]*george.kernels.ExpSine2Kernel(gamma=kernel_param[5],log_period=kernel_param[6])
kernel = kExpSquared + kExpSine2
return kernel
return kernel
2 changes: 1 addition & 1 deletion snmachine/snaugment.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ def extract_proxy_features(self,peak_filter='desr',nproc=1,fit_salt2=False,salt2
#tf=snfeatures.TemplateFeatures(sampler='leastsq')
tf=snfeatures.TemplateFeatures(sampler=sampler)
if salt2feats is None:
salt2feats=tf.extract_features(self.dataset,nprocesses=nproc,use_redshift=fix_redshift)
salt2feats=tf.extract_features(self.dataset,number_processes=nproc,use_redshift=fix_redshift)

tallamjr marked this conversation as resolved.
Show resolved Hide resolved
#fit models and extract r-peakmags
peaklogflux=[]
Expand Down
10 changes: 5 additions & 5 deletions snmachine/snclassifier.py
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -608,7 +608,7 @@ def __call_classifier(classifier, X_train, y_train, X_test, param_dict, return_c


def run_pipeline(features, types, output_name='', columns=[], classifiers=['nb', 'knn', 'svm', 'neural_network', 'boost_dt'],
training_set=0.3, param_dict={}, nprocesses=1, scale=True,
training_set=0.3, param_dict={}, number_processes=1, scale=True,
plot_roc_curve=True, return_classifier=False, classifiers_for_cm_plots=[],
type_dict=None, seed=1234):
"""
Expand All @@ -632,7 +632,7 @@ def run_pipeline(features, types, output_name='', columns=[], classifiers=['nb',
the ID's of the objects to be used
param_dict : dict, optional
Use to run different ranges of hyperparameters for the classifiers when optimising
nprocesses : int, optional
number_processes : int, optional
tallamjr marked this conversation as resolved.
Show resolved Hide resolved
Number of processors for multiprocessing (shared memory only). Each classifier will then be run in parallel.
scale : bool, optional
Rescale features using sklearn's preprocessing Scalar class (highly recommended this is True)
Expand Down Expand Up @@ -707,15 +707,15 @@ def run_pipeline(features, types, output_name='', columns=[], classifiers=['nb',
probabilities = {}
classifier_objects = {}

if nprocesses > 1 and return_classifier:
if number_processes > 1 and return_classifier:
print("Due to limitations with python's multiprocessing module, classifier objects cannot be returned if " \
"multiple processors are used. Continuing serially...")
print()

if nprocesses > 1 and not return_classifier:
if number_processes > 1 and not return_classifier:
partial_func=partial(__call_classifier, X_train=X_train, y_train=y_train, X_test=X_test,
param_dict=param_dict, return_classifier=False)
p = Pool(nprocesses, maxtasksperchild=1)
p = Pool(number_processes, maxtasksperchild=1)
result = p.map(partial_func, classifiers)

for i in range(len(result)):
Expand Down
Loading