Embeddings varied across runs even after setting a fixed seed #186

chunyueli · 2024-10-26T04:54:18Z

chunyueli
Oct 26, 2024

Thank you so much for developing such useful tools.

We are currently using the Cebra model to obtain neural embeddings from calcium neural activity data, and we are pleased with the results. We would like to include the embedding figures in our paper. To ensure the results are reproducible, I set a seed using the following code before creating the Cebra model, and I also set a seed for splitting the training and testing datasets. However, I noticed that the embeddings still vary across runs.

I understand that the Cebra model has inherent consistency, and from your perspective, setting an additional seed may not provide any added benefit. However, we hope to generate same embeddings to avoid unnecessary confusion when reviewers run our code. Could you please provide any guidance on how to correctly set the seed to control randomness and ensure reproducibility? Thank you.

I apologize for raising this topic again. I know similar questions have been asked. I’ve opened a new issue hoping for a quick response due to an upcoming deadline. Thank you for your understanding, and sorry for any inconvenience.

Code:
def set_seed(seed, device_id='cuda:0'):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.set_device(device_id)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Below are the embeddings for the same dataset across runs:

stes · 2024-10-26T12:51:36Z

stes
Oct 26, 2024
Maintainer

Hi @chunyueli , thanks for using CEBRA!

implementation

Implementing seeding has been on the roadmap, but is not fully done yet. Several classes in CEBRA inherit from the HasGenerator base class, which handles seeding for some parts of the codebase.

The minimal implementation would be to consider seeding,

For all classes in cebra.distributions which implement the HasGenerator functionality, this samples the dataset
For model initialization (weight init etc.) -> add explicit seeding when initialising the model
Making the settings you propose for cudnn.

If you are interested in working on this, I would be happy to go back and forth over the required changes.

regarding your dataset

Your dataset presents an additional special case. It seems like your embedding is broken into two parts. The consistency guarantee then does not apply to the whole embedding, but only to the two subparts. Consistency on the whole embedding requires that the ground truth latent space is connected.

If this is a dataset question more than a seeding Q, also happy to follow up on this.

Thoughts?

6 replies

MMathisLab Oct 27, 2024
Maintainer

Just to jump in, you could also simply save the models and share a notebook for review that loads them.

Secondly, if you get very different results per run, something is already off - by design if you have the right dimensions, batch size, and train the same, it should be highly consistent- this is why we don't worry about seeds; running our demos gives highly similar figures as we show in papers.

stes Oct 27, 2024
Maintainer

Hi @chunyueli, if you have embeddings with a non-connected latent space, they will not be consistent globally, only locally within each connected subspace. This is expected behavior of the algorithm (if there is no connection between two parts of the latent space, it is impossible to get a consistent embedding, because the algorithm learns two completely unrelated mappings, both up to a linear indeterminacy -- it is similar to running the algorithm on two unrelated datasets, then you would also not expect to see a consistent relation between them).

I.e. @MMathisLab 's 2nd point applies only if the assumptions of the algorithm are met (in other words, what's "off" here is that you have this disjoint latent space). This behavior is not CEBRA specific, it would apply to any other algorithm as well (until you introduce some structure to the problem that makes the relation possible). In other words,

Yes, our embedding consists of two parts, each corresponding to a specific behavior trial type (behavior label). We hope to make the two subparts reproducible as a whole.

will not work unless we introduce additional information to the problem. (assuming I understood your setup correctly).

The Q is, how to tackle this? The options are:

Implement seeding to make the embedding "reproducible". However in this case, as @MMathisLab suggested, simply saving the trained model (cebra.save) and saving the trained embedding is the much easier option that accomplishes the same outcome. That being said, if you are interested in implementing seeding (which is a good addition to the codebase), happy to walk you through (we should do that in a separate issue/PR and not this discussion board though).
Standardize your learned embeddings. Specifically, you could simply take the embedding for each trial type, apply some algorithm to it that removes the indeterminacy (PCA might already work), and then subsequent runs of CEBRA will give you a reproducible embedding even if the seeds are different.
Train the embeddings in a way that introduce a relation between the two trial types.

Happy to discuss further, and please correct me if I misunderstood sth about the setup.

chunyueli Oct 28, 2024
Author

@stes @MMathisLab Thank you so much for your suggestions. They really helped enhance my understanding of Cebra!

Implement seeding to make the embedding "reproducible". However in this case, as @MMathisLab suggested, simply saving the trained model (cebra.save) and saving the trained embedding is the much easier option that accomplishes the same outcome. That being said, if you are interested in implementing seeding (which is a good addition to the codebase), happy to walk you through (we should do that in a separate issue/PR and not this discussion board though).

I agree that providing the trained model is a more efficient way to ensure reproducible results, so I’ll hold off on implementing seeding for now. Thank you for explaining the seeding process in Cebra—it was very helpful!

if you have embeddings with a non-connected latent space, they will not be consistent globally, only locally within each connected subspace. This is expected behavior of the algorithm (if there is no connection between two parts of the latent space, it is impossible to get a consistent embedding, because the algorithm learns two completely unrelated mappings, both up to a linear indeterminacy -- it is similar to running the algorithm on two unrelated datasets, then you would also not expect to see a consistent relation between them).

To explain my data: in our task, mice are presented with either a left or right tactile stimulus and are required to respond with a directional lick after a delay period. We use Cebra to obtain embeddings specifically for this delay period. For model setup, I define a continuous label representing time frames (1, 2, 3, .., Nxfs, N is the delay duration) and a binary label of shape (Nxfs,), where 0 represents left stimuli and 1 represents right stimuli. I use 80% of the trials as the training dataset and the remaining 20% as the testing dataset. I found similar embedding patterns in both the training and testing datasets, which means the model is learning.

As you mentioned, the global embedding is not connected due to the binary label. However, I wonder if there might still be a consistent relationship between the two subparts, as Cebra tries to separate them as far as possible on a sphere. In some sessions, I do observe consistent embeddings across runs, though this isn’t the case for all sessions. Please forgive any misunderstandings on my part as I continue learning about Cebra.

stes Oct 28, 2024
Maintainer

Thanks for the additional info.

Are you feeding the info about the trial times as continuous variable (= as a float variable to the fit() function)?

Are you feeding the info about the left/right as a binary (=int) or continuous variable (=float)?

From the embeddings you shared, it actually looks like you are passing both variables as integer values. Could you double check the datatypes of the variables you pass to the fit function?

It might be interesting to compare the results of the following options:

model.fit(neural, trial_time.astype(int), trial_type.astype(int)) -> this does not force any structure on the model, can you confirm this is equivalent to what you have right now?
model.fit(neural, trial_time.astype(float), trial_type.astype(int)) -> in this case you might get a smoother variant of the embedding shared above.
model.fit(neural, trial_time.astype(float), weight*trial_type.astype(float)*2-1) -> this might be interesting to check, here you force structure between the dynamics across trial times and the left/right trials (I am assuming that your trial_type variable is 0 or 1, and remap that to -1...1. The weight controls how dominant the time vs. type variables should be).

chunyueli Oct 29, 2024
Author

@stes Thank you so much for your reply!

Are you feeding the info about the trial times as continuous variable (= as a float variable to the fit() function)?

Are you feeding the info about the left/right as a binary (=int) or continuous variable (=float)?

Both variables are float, but trial_type only contains values of 0 or 1. I combine trial times and trial type into a 2D array with shape (sample_num, 2) and then use model.fit(neural, 2D_array). Is this correct? I set my labels following the hippocampus demo, where the binary direction data is combined with position data into a 2D array, which is then used as input to train Cebra. Here is my model:

model = CEBRA(model_architecture='offset1-model',
batch_size=None,
learning_rate=0.001,
temperature=1,
output_dimension=3,
max_iterations=2000,
distance='cosine',
conditional='time_delta',
device='cuda_if_available',
verbose=True,
time_offsets=1)

Could I just confirm if I’m thinking about this correctly? In my case, the embedding may not be globally consistent because my trial_time label is one-directional, unlike the position variable (mice run back and forth) in your demo. Does that make sense? Thank you for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embeddings varied across runs even after setting a fixed seed #186

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Embeddings varied across runs even after setting a fixed seed #186

chunyueli Oct 26, 2024

Replies: 1 comment · 6 replies

stes Oct 26, 2024 Maintainer

implementation

regarding your dataset

MMathisLab Oct 27, 2024 Maintainer

stes Oct 27, 2024 Maintainer

chunyueli Oct 28, 2024 Author

stes Oct 28, 2024 Maintainer

chunyueli Oct 29, 2024 Author

chunyueli
Oct 26, 2024

Replies: 1 comment 6 replies

stes
Oct 26, 2024
Maintainer

MMathisLab Oct 27, 2024
Maintainer

stes Oct 27, 2024
Maintainer

chunyueli Oct 28, 2024
Author

stes Oct 28, 2024
Maintainer

chunyueli Oct 29, 2024
Author