Trying to understand compiler optimizations for data parallelism #24029

kvablack · 2024-09-30T20:53:29Z

kvablack
Sep 30, 2024

Hi all,

I'm currently running data-parallel training, and I would like to better understand how the JIT compiler inserts collective communication operations. My current setup, which I think is the "canonical" setup, is like this:

mesh = jax.sharding.Mesh(mesh_utils.create_device_mesh((jax.device_count(),)), ["batch"])
data_parallel_sharding = jax.sharding.NamedSharding(mesh, jax.sharding.PartitionSpec("batch"))
replicated_sharding = jax.sharding.NamedSharding(mesh, jax.sharding.PartitionSpec())

@partial(jax.jit, in_shardings=(replicated_sharding, data_parallel_sharding), out_shardings=(replicated_sharding, replicated_sharding)
def train_step(params, batch):
    loss, grads = jax.value_and_grad(compute_loss)(params, batch)
    new_params = jax.tree.map(lambda p: p + g * lr, params, grads)
    return new_params, loss

My understanding is that the JIT compiler will automatically partition the computation across devices (due to the data parallel sharding) as well as insert the necessary all-reduce operations to keep the parameters replicated. My mental model for the sequence of operations is that each device does its own individual forward + backward pass, then the gradients are all-reduced, then each device does a gradient update.

However, when I look at the optimized HLO dump and Perfetto trace, I see all-reduce operations sprinkled throughout the entire computation. Furthermore, when I add up the number of all-reduced elements, it is only ~60% of the total number of parameters. Lastly, when I enable mixed-precision training (bfloat16 activations), I see some of the all-reduces performed in bfloat16 (even though the loss and gradients are definitely all float32 at tracing time). Here are my concrete questions:

I think all-reducing in bfloat16 is generally a bad idea. I am having some numerical stability issues with lower learning rates, so this is pretty concerning to me. Is there a way to force JAX to do gradient reduction in float32?
How is it possible for only ~60% of the parameters to get reduced? Is there a way to get more visibility into why and how JAX is optimizing communication (including its choice to reduce some arrays in bfloat16, and others in float32)?

To try and understand things better, I implemented a shard_map version to manually take control of cross-device communication, like so:

mesh = jax.sharding.Mesh(mesh_utils.create_device_mesh((jax.device_count(),)), ["batch"])
data_parallel_sharding = jax.sharding.NamedSharding(mesh, jax.sharding.PartitionSpec("batch"))
replicated_sharding = jax.sharding.NamedSharding(mesh, jax.sharding.PartitionSpec())

@partial(jax.jit, in_shardings=(replicated_sharding, data_parallel_sharding), out_shardings=(replicated_sharding, replicated_sharding)
@partial(shard_map, mesh=mesh, in_specs=(replicated_sharding.spec, data_parallel_sharding.spec), out_specs=(replicated_sharding.spec, replicated_sharding.spec))
def train_step(params, batch):
    loss, grads = jax.value_and_grad(compute_loss)(params, batch)
    loss, grads = jax.lax.pmean((loss, grads), axis_name="batch")
    new_params = jax.tree.map(lambda p: p + g * lr, params, grads)
    return new_params, loss

Looking at the dumped HLO, it matches up much more with what I expect -- there is a big block of all-reduce operations at the end, they are all float32, and the total number of elements roughly adds up to the size of the network (although it's still missing 5% somehow -- maybe I miscounted). This gave me some more questions:

This shard_map example does jax.grad outside of shard_map, whereas I did it inside. However, this JEP specifically points out that there is an efficiency issue taking a grad of shard_map. What's wrong with keeping the grad inside, like I did?
Is there a way to get the automatic version to produce more similar HLO to the shard_map version?
Is it possible for the "more optimized" automatic version to be slower than the shard_map version? I ran into this awhile ago, which I documented in this issue. I use automatic partitioning for more advanced parallelism strategies (such as FSDP) but this makes me worry that I'm losing a ton of performance by not manually writing every matmul using shard_map.

mattjj · 2024-10-01T16:20:21Z

mattjj
Oct 1, 2024
Maintainer

Thanks for the questions!

I think all-reducing in bfloat16 is generally a bad idea. I am having some numerical stability issues with lower learning rates, so this is pretty concerning to me. Is there a way to force JAX to do gradient reduction in float32?

How is it possible for only ~60% of the parameters to get reduced? Is there a way to get more visibility into why and how JAX is optimizing communication (including its choice to reduce some arrays in bfloat16, and others in float32)?

Could you share a reproducer we could study? It's hard to guess at what's happening here.

This shard_map example does jax.grad outside of shard_map, whereas I did it inside. However, this JEP specifically points out that there is an efficiency issue taking a grad of shard_map. What's wrong with keeping the grad inside, like I did?

Well, that JEP is meant to explain that shard_map fixes the efficiency issue which affected pmap. There's nothing wrong with keeping the grad inside; it's just that with pmap you kinda had to do that, while with shard_map you can do either. So just write it however you find most natural!

Is there a way to get the automatic version to produce more similar HLO to the shard_map version?

It's hard to say what's going on with the automatic version, but in general I would suggest adding jax.lax.with_sharding_constraint annotations in your code. That said, the advantage of the explicit shard_map version is exactly that you know what is going on.

Are there any downsides for your work to just sticking with the manual approach?

(We're also working on another mode, where you still get to write code as if programming a single device, with a 'global view' of arrays and no explicit collectives, but rather than having sharding and partitioning decisions happen opaquely in the compiler, they're transparent and explicit at trace time, so that e.g. you can reflect on sharding just like you can on shapes.)

Is it possible for the "more optimized" automatic version to be slower than the shard_map version? I ran into this awhile ago, which I documented in this issue.

Yes, very possible. Actually, I wouldn't consider the automatic version to be "more optimized" in general; there's nothing it can do that you can't express with shard_map. Instead, think of "automatic" as "more convenient", like you can get decent performance or scalability without thinking much about distributed computing. Then to optimize the code, people either add more sharding constraints or drop into shard_map where appropriate. (You can do this incrementally, i.e. just handle some aspects of parallelism manually and others automatically, by using the auto argument in shard_map; that's a little clunky and under-documented, and we've been meaning to land a simplification...)

I use automatic partitioning for more advanced parallelism strategies (such as FSDP) but this makes me worry that I'm losing a ton of performance by not manually writing every matmul using shard_map.

If you're losing significant performance, I suggest opening a bug (like this one!) with some kind of reproducer we can go on. And add sharding constraints with jax.lax.with_sharding_constraint; in practice, such constraints are necessary to get good performance, rather than relying solely on XLA's sharding propagation from annotated inputs.

What do you think?

0 replies

kvablack · 2024-10-01T17:02:42Z

kvablack
Oct 1, 2024
Author

Hi Matt,

Thanks for the response! A minimal reproducer will take a bit of time and effort -- let me get back to you on that. It's worth noting, though, that I did submit the aforementioned performance issue with a minimal reproducer (at least as minimal as I could get it).

Are there any downsides for your work to just sticking with the manual approach?

At the moment, the shard_map version is producing a different loss curve, so I'm a bit hesitant to switch everything over (although that could be an issue on my end -- I need to run more tests). Aside from that, my main reservation was the difficulty of handling more advanced parallelism, like FSDP -- I didn't see the auto argument though, so I'll look into that; thanks for pointing out that feature!

add sharding constraints with jax.lax.with_sharding_constraint; in practice, such constraints are necessary to get good performance, rather than relying solely on XLA's sharding propagation from annotated inputs

Do you think you could give a bit more detail on where these annotations might be necessary? For simple data parallelism, I would imagine that the sharding propagation is fairly trivial. I've used jax.debug.inspect_array_sharding quite a bit, and it always matches up with what I expect -- parameters/gradients fully replicated, activations sharded along the batch dimension. (I assume that if the sharding already matches what I expect, then jax.lax.with_sharding_constraint will have no effect.) I'm not sure where else to add sharding constraints -- would I need to add them in the backward pass using jax.custom_vjp?

0 replies

mattjj · 2024-10-01T20:21:01Z

mattjj
Oct 1, 2024
Maintainer

It's worth noting, though, that I did submit the aforementioned performance issue with a minimal reproducer (at least as minimal as I could get it).

Oops, I see! Sorry we haven't yet followed up on that to its conclusion.

Aside from that, my main reservation was the difficulty of handling more advanced parallelism, like FSDP -- I didn't see the auto argument though, so I'll look into that; thanks for pointing out that feature!

Is FSDP hard to write with shard_map? There's an example in the docs as you probably know, which is basically "do an all_gather", but maybe that's not actually representing everything you had in mind? (Looking at that example again, I suspect we may not want the jax.remat there in practice...)

Do you think you could give a bit more detail on where these annotations might be necessary?

I actually don't know; I'm not enough of a practitioner. But if you have some code that is misbehaving, e.g. with extra allreduces, I'd try to constrain the sharding around that misbehavior.

I'm not sure where else to add sharding constraints -- would I need to add them in the backward pass using jax.custom_vjp?

I think people usually just add them in the primal computation (ie the computation being differentiated), and those have an effect on the backward pass code in that a with_sharding_constraint on the forward pass becomes a with_sharding_constraint on the backward pass applied to the corresponding gradients.

3 replies

kvablack Oct 1, 2024
Author

Is FSDP hard to write with shard_map?

A single layer is easy, but adapting a full model implementation is quite a bit of effort, since you need to add the appropriate logic to every single matmul, which may additionally require modifying neural network library implementations (e.g., flax.linen.Dense). In comparison, setting sharding constraints on the parameter arrays and letting the automatic partitioner do the rest (example) is super easy. It feels like magic and is one of the reasons I love JAX!

But if you have some code that is misbehaving, e.g. with extra allreduces, I'd try to constrain the sharding around that misbehavior

Perhaps the crux of the issue here is that I have no visibility into exactly where the code is misbehaving 😅 . All I can do is look at an optimized HLO dump, which is quite hard to parse (it's 25Mb of text!). This allowed me to find AllReduces that I don't think should be there, but it's impossible to tell what arrays are being reduced or why.

sholtodouglas Oct 8, 2024
Collaborator

Can you link the profile trace (I think this is the .xplane file or something - do you know Matt?), that will make it easier to figure out what exactly is going on.

Also - can you try with FSDP? Afaik the compiler has more pattern matches to FSDP since there is ~no disadv to doing that as compared to normal data parallelism.

One reason the all-reduces might be sprinkled throughout is that the compiler is trying to overlap the comms with the compute of the bwd pass.

kvablack Oct 8, 2024
Author

Hi Sholto,

Here are the trace files (.xplane.pb and .trace.json.gz) from our actual training pipeline. Not sure how helpful it will be without the source code though. I am also still seeing a huge performance difference between shard_map and no shard_map for this self-contained example.

Thanks for the tip about FSDP -- I'll take a look at some profile traces for that soon.

trace.zip

VinodTipparaju · 2024-10-08T17:31:42Z

VinodTipparaju
Oct 8, 2024

Does Jax have pipeline/model/tensor parallelism options? That may be the clue to what is happening under the hood. What is generated is not as trivial as forward+backward/*+collective.

Thanks,
Vinod.

0 replies

zayednetad · 2024-10-25T22:03:18Z

zayednetad
Oct 25, 2024

Hi @kvablack , Just wondering if you are still waiting for solution? I did some experiment with your code on a 8xA100 GPU BM and found all-reduce was 100%, was able to make JAX to cast FT32 during all-reduce to avoid numeric instability while other necesaary operations were in bfloat16 in a mixed precession training.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to understand compiler optimizations for data parallelism #24029

{{title}}

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Trying to understand compiler optimizations for data parallelism #24029

kvablack Sep 30, 2024

Replies: 5 comments · 3 replies

mattjj Oct 1, 2024 Maintainer

kvablack Oct 1, 2024 Author

mattjj Oct 1, 2024 Maintainer

kvablack Oct 1, 2024 Author

sholtodouglas Oct 8, 2024 Collaborator

kvablack Oct 8, 2024 Author

VinodTipparaju Oct 8, 2024

zayednetad Oct 25, 2024

kvablack
Sep 30, 2024

Replies: 5 comments 3 replies

mattjj
Oct 1, 2024
Maintainer

kvablack
Oct 1, 2024
Author

mattjj
Oct 1, 2024
Maintainer

kvablack Oct 1, 2024
Author

sholtodouglas Oct 8, 2024
Collaborator

kvablack Oct 8, 2024
Author

VinodTipparaju
Oct 8, 2024

zayednetad
Oct 25, 2024