Parallel networks using flax #2862

mohamad-amin · 2023-02-10T01:16:07Z

mohamad-amin
Feb 10, 2023

Is it possible to create multiple instances of a network and optimize all of them in parallel on the same data? I'm trying to run a small experiment a lot of times and was wondering if flax + optax + pmap is a good combination to get this done. Theoretically, flax + optax + pmap should be the ideal candidate for getting this done, but I'm not sure if there's any technical burdens (like flax models not being hashable, hypothetically, or ...).

Thanks!

Answered by cgarciae

Feb 10, 2023

Hey @mohamad-amin, what you describe is certainly possible with pmap + Flax. Seems you want to create an ensemble and train it in parallel, check out the Ensembling on multiple devices guide for some pointers.

but I'm not sure if there's any technical burdens (like flax models not being hashable, hypothetically, or ...).

The train_step is usually a vanilla JAX function (jit/pmap) so there should be no issues if you follow conventions like using flax.training.train_state.TrainState to pass the apply function and stuff like that, for more info check out our (recently updated) Quick Start guide.

View full answer

cgarciae · 2023-02-10T14:43:32Z

cgarciae
Feb 10, 2023
Maintainer

Hey @mohamad-amin, what you describe is certainly possible with pmap + Flax. Seems you want to create an ensemble and train it in parallel, check out the Ensembling on multiple devices guide for some pointers.

but I'm not sure if there's any technical burdens (like flax models not being hashable, hypothetically, or ...).

The train_step is usually a vanilla JAX function (jit/pmap) so there should be no issues if you follow conventions like using flax.training.train_state.TrainState to pass the apply function and stuff like that, for more info check out our (recently updated) Quick Start guide.

0 replies

wqlevi · 2024-12-03T16:04:51Z

wqlevi
Dec 3, 2024

Hi @cgarciae ,

I'm also curious about the parallel training scheme written in the Ensembling on multiple devices; specifically, I don't understand why here replicate is used to propagate batched data, shouldn't different mini-batch be trained on each device (instead of identical replica)?

Thanks,

3 replies

cgarciae Dec 3, 2024
Maintainer

The idea of an ensemble is to train different models on the same data, this is why replicate is used here. I think the example is not ideal in the sense that most people want to do data-parallel training which is the opposite.

wqlevi Dec 4, 2024

Thanks for the concise explanation, this makes sense a lot.

By the way, are you aware of any blog/documentation for distributed data parallel using flax? I'm new to flax and find it hard to understand the data parallelism (e.g. shard batch from dataloader or pmap on the newly created axis with reshaped bacth size).

Thanks in advance for your kind help!

cgarciae Dec 5, 2024
Maintainer

Have you seen the Scale up on multiple devices guide?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel networks using flax #2862

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Parallel networks using flax #2862

mohamad-amin Feb 10, 2023

Replies: 2 comments · 3 replies

cgarciae Feb 10, 2023 Maintainer

wqlevi Dec 3, 2024

cgarciae Dec 3, 2024 Maintainer

wqlevi Dec 4, 2024

cgarciae Dec 5, 2024 Maintainer

mohamad-amin
Feb 10, 2023

Replies: 2 comments 3 replies

cgarciae
Feb 10, 2023
Maintainer

wqlevi
Dec 3, 2024

cgarciae Dec 3, 2024
Maintainer

cgarciae Dec 5, 2024
Maintainer