-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue recuiting GPU to AE training #27
Comments
Very strange, sorry for your troubles :( One thing you could try is running the integration test, which makes fake data and then runs a bunch of models on that data. It only runs each model for a couple epochs, so should be pretty fast. All you need to do is run the following command from the top level of the behavenet repo:
Give that a try and let me know if you have the same issue. |
Hey, thanks for getting back to me! It gets stuck in the same way when I run integration.py. see output below. I will keep trying to find the problem.. My best guess is that it results from fiddling with this: " Maxime (behavenet) C:\Users\cheveemf\Documents\GitHub\Maxime_tools\behavenet-master>python tests/integration.py model: ae
|
I am trying to re-install everything and got this error: I did encounter it the first time around and ended up just commenting torch out of the environment text file and installing it on its own afterwards using pip. Could that be the issue? |
Hello, That seems to do the trick, thanks for your help! |
Hello,
I would like to run the AE on my own videos but I cannot get it to work with my GPU.
The first problem:
_(behavenet) C:\Users\cheveemf\Documents\GitHub\Maxime_tools\behavenet-master>python behavenet/fitting/ae_grid_search.py --data_config C:\Users\cheveemf.behavenet/Maxime_3120-210303-125248_params.json --model_config C:\Users\cheveemf/.behavenet/ae_model.json --training_config C:\Users\cheveemf/.behavenet/ae_training.json --compute_config C:\Users\cheveemf/.behavenet/ae_compute.json
Traceback (most recent call last):
File "behavenet/fitting/ae_grid_search.py", line 181, in
Traceback (most recent call last):
File "", line 1, in
hyperparams.optimize_parallel_gpu(main, gpu_ids=parallel_gpu_ids)
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\site-packages\test_tube\argparse_hopt.py", line 348, in optimize_parallel_gpu
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\spawn.py", line 105, in spawn_main
self.pool = Pool(processes=nb_workers, initializer=init, initargs=(gpu_q,))
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\context.py", line 119, in Pool
exitcode = _main(fd)
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\spawn.py", line 115, in _main
context=self.get_context())
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\pool.py", line 176, in init
self = reduction.pickle.load(from_parent)
EOFError : self._repopulate_pool()Ran out of input
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\pool.py", line 241, in _repopulate_pool
w.start()
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\context.py", line 322, in Popen
return Popen(process_obj)
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)
File "C:\Users\cheveemf\Anaconda3\envs\behavenet\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu..init'
I "fixed" this issue following advice from this post: #8. I moved the
def init(local_gpu_q):
global g_gpu_id_q
g_gpu_id_q = local_gpu_q
out of the class HyperOptArgumentParser in argparse_hopt (test tube), and that seems remove the error, but it gets stuck somewhere and I can't find it... :
_(behavenet) C:\Users\cheveemf\Documents\GitHub\Maxime_tools\behavenet-master>python behavenet/fitting/ae_grid_search.py --data_config C:\Users\cheveemf.behavenet/Maxime_3120-210303-125248_params.json --model_config C:\Users\cheveemf/.behavenet/ae_model.json --training_config C:\Users\cheveemf/.behavenet/ae_training.json --compute_config C:\Users\cheveemf/.behavenet/ae_compute.json
DATA CONFIG:
lab: Maxime
expt: 3120-210303-125248
animal: 3120
session: 210303
n_input_channels: 1
y_pixels: 330
x_pixels: 370
use_output_mask: False
frame_rate: 20.0
neural_type: None
neural_bin_size: 0.05
approx_batch_size: 200
COMPUTE CONFIG:
device: cuda
n_parallel_gpus: 1
gpus_viz: 0
tt_n_gpu_trials: 128
tt_n_cpu_trials: 1000
tt_n_cpu_workers: 5
mem_limit_gb: 8.0
TRAINING CONFIG:
export_train_plots: True
export_latents: True
pretrained_weights_path: None
val_check_interval: 1
learning_rate: 0.0001
max_n_epochs: 1000
min_n_epochs: 500
enable_early_stop: False
early_stop_history: 10
rng_seed_train: None
as_numpy: False
batch_load: True
rng_seed_data: 0
train_frac: 1.0
trial_splits: 8;1;1;0
MODEL CONFIG:
experiment_name: ae-example
model_type: conv
n_ae_latents: 12
l2_reg: 0.0
rng_seed_model: 0
fit_sess_io_layers: False
ae_arch_json: None
model_class: ae
using data from following sessions:
F:\ISX videos to run through DLC\3120\3120-210303-125248\TDT video data\Behavenet\Maxime\3120-210303-125248\3120\210303
constructing data generator...done
Generator contains 1 SingleSessionDatasetBatchedLoad objects:
Maxime_3120-210303-125248_3120_210303
signals: ['images']
transforms: OrderedDict([('images', None)])
paths: OrderedDict([('images', 'F:\\ISX videos to run through DLC\\3120\\3120-210303-125248\\TDT video data\\Behavenet\Maxime\3120-210303-125248\3120\210303\data.hdf5')])
constructing model...Initializing with random weights
done
Autoencoder architecture
Encoder architecture:
00: ZeroPad2d(padding=(1, 2, 1, 2), value=0.0)
01: Conv2d(1, 32, kernel_size=(5, 5), stride=(2, 2))
02: LeakyReLU(negative_slope=0.05)
03: Conv2d(32, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
04: LeakyReLU(negative_slope=0.05)
05: Conv2d(64, 128, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
06: LeakyReLU(negative_slope=0.05)
07: ZeroPad2d(padding=(2, 2, 1, 2), value=0.0)
08: Conv2d(128, 256, kernel_size=(5, 5), stride=(2, 2))
09: LeakyReLU(negative_slope=0.05)
10: ZeroPad2d(padding=(0, 1, 2, 2), value=0.0)
11: Conv2d(256, 512, kernel_size=(5, 5), stride=(5, 5))
12: LeakyReLU(negative_slope=0.05)
13: Linear(in_features=12800, out_features=12, bias=True)
Decoder architecture:
00: Linear(in_features=12, out_features=12800, bias=True)
01: ConvTranspose2d(512, 256, kernel_size=(5, 5), stride=(5, 5))
02: LeakyReLU(negative_slope=0.05)
03: ConvTranspose2d(256, 128, kernel_size=(5, 5), stride=(2, 2))
04: LeakyReLU(negative_slope=0.05)
05: ConvTranspose2d(128, 64, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
06: LeakyReLU(negative_slope=0.05)
07: ConvTranspose2d(64, 32, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
08: LeakyReLU(negative_slope=0.05)
09: ConvTranspose2d(32, 1, kernel_size=(5, 5), stride=(2, 2))
10: Sigmoid()
epoch 0000/1000
0%| | 0/80 [00:00<?, ?it/s]_
It never progresses.
A few notes:
Any ideas would be very welcome :)
The text was updated successfully, but these errors were encountered: