Fix halo again #23

ASKabalan · 2024-07-05T10:48:22Z

This PR comes after this one #19
Instead of allocating using mlir.full_like_aval I allocate using cudecompMalloc but I keep a fallback to XLA if needed using JD_ALLOCATE_WITH_XLA=1
@aboucaud @EiffL Good for review after merging #19

aboucaud · 2024-07-08T22:03:56Z

Out of my league..

ASKabalan · 2024-07-08T23:20:56Z

Still much simpler than JAX distributed 🥲

EiffL · 2024-07-09T18:40:28Z

Hum... but I don't understand what was the issue here, as long as the workspace is created by XLA in each call the primitive and that the primitive uses the provided workspace in the input stream, what goes wrong?

ASKabalan · 2024-07-09T19:04:24Z

The workspace is not created at each call.
It is only created at the lowering phase as ir.constants

If I do an LPT it is fine.
If I do a nbody sim with a mesh size of 256 it is also fine (on a V100 32G)
512 and above I get a deadlock, with some further inspection it apeared to me that there is an illegal access of memory by some of the processes (which caused the deadlock)

I inspected the memory and found :

Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 0
Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 1
Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 2

Calling cudecompUpdateHalosX : 0x151003004300 On axis 0
Calling cudecompUpdateHalosX : 0x151003004300 On axis 1
Calling cudecompUpdateHalosX : 0x151003004300 On axis 2

XLA changes the memory address which should not happen (again the lowering happends only once)
This means that the memory have been repurposed.

Using cudecompmalloc this does not happend.

You can still reproduce if you get a 512 or 1024 mesh on no more than 4 GPUs with the flag JD_ALLOCATE_WITH_XLA

ASKabalan · 2024-07-09T19:06:32Z

To get the error to show, you need to remove all block_until_ready() statements

EiffL · 2024-07-09T19:26:00Z

Hum, but the thing that worries me is that the same could happen for the other ops no? And I don't see why the memory provided in the stream to the kernel would not be ok.

…

On Tue, Jul 9, 2024, 3:04 PM Wassim KABALAN ***@***.***> wrote: The workspace is not created at each call. It is only created at the lowering phase as ir.constants If I do an LPT it is fine. If I do a nbody sim with a mesh size of 256 it is also fine (on a V100 32G) 512 and above I get a deadlock, with some further inspection it apeared to me that there is an illegal access of memory by some of the processes (which caused the deadlock) I inspected the memory and found : Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 0 Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 1 Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 2 Calling cudecompUpdateHalosX : 0x151003004300 On axis 0 Calling cudecompUpdateHalosX : 0x151003004300 On axis 1 Calling cudecompUpdateHalosX : 0x151003004300 On axis 2 XLA changes the memory address which should not happen (again the lowering happends only once) This means that the memory have been repurposed. Using cudecompmalloc this does not happend. You can still reproduce if you get a 512 or 1024 on no more than 4 GPUs with the flag JD_ALLOCATE_WITH_XLA — Reply to this email directly, view it on GitHub <#23 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGSLFZYSND6GGANMNAAPADZLQX43AVCNFSM6AAAAABKM7FAX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJYGQ2DAMZZGY> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

EiffL · 2024-07-09T19:27:01Z

Are you sure it's not something different like not having the right size of workspace? Maybe we have a different meaning for halo size

…

On Tue, Jul 9, 2024, 3:25 PM Eiffel ***@***.***> wrote: Hum, but the thing that worries me is that the same could happen for the other ops no? And I don't see why the memory provided in the stream to the kernel would not be ok. On Tue, Jul 9, 2024, 3:04 PM Wassim KABALAN ***@***.***> wrote: > The workspace is not created at each call. > It is only created at the lowering phase as ir.constants > > If I do an LPT it is fine. > If I do a nbody sim with a mesh size of 256 it is also fine (on a V100 > 32G) > 512 and above I get a deadlock, with some further inspection it apeared > to me that there is an illegal access of memory by some of the processes > (which caused the deadlock) > > I inspected the memory and found : > > Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 0 > Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 1 > Calling cudecompUpdateHalosX : 0x14a75f004300 On axis 2 > > Calling cudecompUpdateHalosX : 0x151003004300 On axis 0 > Calling cudecompUpdateHalosX : 0x151003004300 On axis 1 > Calling cudecompUpdateHalosX : 0x151003004300 On axis 2 > > XLA changes the memory address which should not happen (again the > lowering happends only once) > This means that the memory have been repurposed. > > Using cudecompmalloc this does not happend. > > You can still reproduce if you get a 512 or 1024 on no more than 4 GPUs > with the flag JD_ALLOCATE_WITH_XLA > > — > Reply to this email directly, view it on GitHub > <#23 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAGSLFZYSND6GGANMNAAPADZLQX43AVCNFSM6AAAAABKM7FAX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJYGQ2DAMZZGY> > . > You are receiving this because you were mentioned.Message ID: > ***@***.*** > com> >

ASKabalan · 2024-07-09T19:40:07Z

I am going to try to reproduce and tell you.
But the halo_extent here is half of the halo size but what I see is there is enough space
The size of the input is mesh + half of the halo and the other half is the extent

this should be enough

So yeah, for the FFT, the allocation is handled by us and not cufft because we use cufftsetautoallocation and the work is shared for the FFT and Transpose

The only difference I see is that the FFT is inplace (it uses some temporary parameters but nothing dangerous)
The transpositions are inplace aswell

The supplement memory required by a halo exchange is more?
Anyway it cannot be anything else, since when I use cudecomp the illegal access is not reproduced

jaxdecomp/_src/halo.py

src/halo.cu

EiffL · 2024-07-19T14:31:07Z

I think it's better to let XLA handle memory, because it can decide to free it when not necessary.

EiffL

We should remove the cuda memory allocation thing for halos, better to be consistent in memory handling throughout

ASKabalan · 2024-07-19T14:49:44Z

I just tested.
It is ok for me.
Can I merge?

ASKabalan and others added 5 commits July 5, 2024 12:06

fix : Reduce halo, it should be half of the halo size

52f7afe

Allocate using cuda for halo_exchange by default ..allow fallback to XLA

0ae6f56

Clean up all GridDescriptor and halo_exchange memory at finalize

e3d4c69

Do not allocate with XLA if not selected

8ca8f36

Merge branch 'main' into fix_halo_again

82f1395

set worksize correctly when using Halo

75e7032

EiffL reviewed Jul 19, 2024

View reviewed changes

jaxdecomp/_src/halo.py Outdated Show resolved Hide resolved

EiffL reviewed Jul 19, 2024

View reviewed changes

src/halo.cu Show resolved Hide resolved

EiffL self-requested a review July 19, 2024 14:38

EiffL requested changes Jul 19, 2024

View reviewed changes

revert cuda allocation for halo, only XLA is done

d59c638

ASKabalan requested a review from EiffL July 19, 2024 14:50

Merge remote-tracking branch 'upstream/main' into fix_halo_again

81a4095

EiffL approved these changes Jul 19, 2024

View reviewed changes

ASKabalan merged commit cc642fb into main Jul 19, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix halo again #23

Fix halo again #23

ASKabalan commented Jul 5, 2024 •

edited

Loading

aboucaud commented Jul 8, 2024

ASKabalan commented Jul 8, 2024

EiffL commented Jul 9, 2024

ASKabalan commented Jul 9, 2024 •

edited

Loading

ASKabalan commented Jul 9, 2024

EiffL commented Jul 9, 2024 via email

EiffL commented Jul 9, 2024 via email

ASKabalan commented Jul 9, 2024 •

edited

Loading

EiffL commented Jul 19, 2024

EiffL left a comment

ASKabalan commented Jul 19, 2024

Fix halo again #23

Fix halo again #23

Conversation

ASKabalan commented Jul 5, 2024 • edited Loading

aboucaud commented Jul 8, 2024

ASKabalan commented Jul 8, 2024

EiffL commented Jul 9, 2024

ASKabalan commented Jul 9, 2024 • edited Loading

ASKabalan commented Jul 9, 2024

EiffL commented Jul 9, 2024 via email

EiffL commented Jul 9, 2024 via email

ASKabalan commented Jul 9, 2024 • edited Loading

EiffL commented Jul 19, 2024

EiffL left a comment

Choose a reason for hiding this comment

ASKabalan commented Jul 19, 2024

ASKabalan commented Jul 5, 2024 •

edited

Loading

ASKabalan commented Jul 9, 2024 •

edited

Loading

ASKabalan commented Jul 9, 2024 •

edited

Loading