Kirchhoff cuda integrated and tested successfully #617

Amnah2020 · 2024-10-20T13:31:09Z

No description provided.

mrava87 · 2024-11-02T19:32:22Z

@Amnah2020 thanks for your contribution 😄

I have started looking at your code focusing on the case when dynamic=False and I have some general comments/questions plus some specific one that I make directly on the line of code I am unsure about:

the case travsrcrec=False is still available for historical reasons (backward compatibility) but it is not really a good choice given that it creates a big burden on memory (storing a matrix of size nsnrnx*nz instead of 2 matrices of overall size (ns+nr)nxnz. I therefore reccomend that we do not include an option to allow travsrcrec=False in the cuda implementation.
more forward thinking comment: I think you have made a design choice that goes a bit against what we have so far in PyLops when it comes to numba-backed operators. You assume that the input arrays entering your _matvec_call/_rmatvec_call are always numpy arrays and they are converted internally into numba arrays (using the cuda.to_device). In all other methods, we instead assume that the input arrays have been already put on the gpu. So, why is your approach not idea? Take this Kirchhoff operator as example, this a chained operator composed of a main method (e.g., _trav_kirch_matvec) and a convolution operator; now, if you force the input and output to be numpy arrays, the convolve operator will not be able to leverage the extra speed one could get when using the GPU (i.e., applying it to a cupy array). I understand that in large applications one may not be able to put the entire x, trav_src, trav_rec, etc. onto the GPU upfront, and after some thinking I believe I have found an elegant solution. I suggest that we create a new operator called ToCupy which simply lifts a numpy array onto a cupy array in forward mode and does the opposite in adjoint mode. This way we could simply do:

Kop = Kirchhoff(
                z, x, t, sources, recs, v0, 
                wav.astype(dtype), 
                wavc, mode='analytic', 
                engine='cuda', dynamic=dynamic, 
                dtype=dtype)
Top = ToCupy(Kop.dims)
Top1 = ToCupy(Kop.dimsd)
Kop = Top1.H @ Kop @ Top

and get an operator that takes a numpy array as input, moves it to cupy, runs the entire Kirchhoff operator on GPU (including the convolution) and then moves the output back to numpy. This way, we could easily split sources into groups and do:

nsrop = 3 # number of source for each single operator
Kops = []
for isrc in range(0, ns, 3):
    Kop = Kirchhoff(z, x, t, sources[:, isrc:min(isrc + nsrop, ns)], recs, v0, 
                    wav.astype(dtype), 
                    wavc, mode='analytic', 
                    engine='cuda', dynamic=dynamic, 
                    dtype=dtype)
    Top = ToCupy(Kop.dims, dtype=dtype)
    Top1 = ToCupy(Kop.dimsd, dtype=dtype)
    Kop = Top1.H @ Kop @ Top
    Kops.append(Kop)
Kops = VStack(Kops)

Of course, this can be further wrapped into a pylops_mpi.MPIVStack to further distribute groups of sources to multiple MPI ranks...

I have documented most of these points here https://github.com/PyLops/pylops_notebooks/blob/master/developement-cupy/Kirchhoff.ipynb

Please reply to my questions before pushing any code, as I have made a significant amount of changes locally (the notebook will unlikely not work for you...), so based on your replies I can finalize them and push them before we continue finalizing the PR.

PS: once we agree on those, I'll look in more details at the case when dynamic=True and then we can merge the PR.

mrava87 · 2024-11-02T19:43:12Z

pylops/waveeqprocessing/_kirchhoff_cuda.py

+                wrap = current_device.WARP_SIZE
+                multipr_count = current_device.MULTIPROCESSOR_COUNT
+                self.num_threads_per_blocks = (wrap, wrap)
+                self.num_blocks = (multipr_count, multipr_count)


I am unsure about this. For my GPU (RTX 3090), the output of this:

from numba import cuda current_device = cuda.get_current_device() current_device.WARP_SIZE, current_device.MULTIPROCESSOR_COUNT

gives (32,82). This means that indipendently on the number of sources and receivers, you always have a 82x82 grid with 32x32 threads, which in many cases will call many blocks that will never pass the if ns > isrc and nr > irec: condition. Am I getting this wrong?

In this case, each block always has a fixed number of threads, specifically (32, 32). The number of blocks, however, depends on the total number of threads required with a maximum of 82*82 grid. This implementation ensures that no block is allocated unless it will run at least one thread.

However, I need to fix the last else as it does not follow the same approach ..

Not sure I understand your reply. My point is that you code will ALWAYS allocate the same number of blocks no matter the size of the problem.

Take this code:

wrap = current_device.WARP_SIZE multipr_count = current_device.MULTIPROCESSOR_COUNT self.num_threads_per_blocks = (wrap, wrap) self.num_blocks = (multipr_count, multipr_count)

on my GPU (RTX 3090), I get self.num_threads_per_blocks = (32, 32) and self.num_blocks = (82, 82). Now later, you do:

self._trav_kirch_matvec_cuda[self.num_blocks, self.num_threads_per_blocks](x_d, y_d, nsnr_d, nt_d, ni_d, itrav_d, travd_d)

this will ALWAYS call 82x82 blocks no matter how many sources or receivers you have in your problem. So say for 64 sources and 64 receivers, you will have the first 2x2 blocks doing something and all the other running but just never passing this condition if ns > isrc and nr > irec:.

In the new code, i changed this to:

self.num_blocks = ((self.ns + self.num_threads_per_blocks[0] - 1 ) // self.num_threads_per_blocks[0], (self.nr + self.num_threads_per_blocks[1] - 1 ) // self.num_threads_per_blocks[1])

which is actually similar to what you had done for case 4. Here we don't launch any block that will not contribute to the final solution

Thank you for correcting the kernel launch, as I did not follow the same approach as in the other cases.

I missed this part and ended up copying a lot of code because I didn't fork the repo before starting to make changes.

mrava87 · 2024-11-02T19:43:43Z

pylops/waveeqprocessing/_kirchhoff_cuda.py

+    @staticmethod
+    @cuda.jit
+    def _travsrcrec_kirch_rmatvec_cuda(x, y, ns, nr, nt, ni, dt, trav_srcs, trav_recs):
+        irec, isrc = cuda.grid(2)


why this is swapped compared to line 136?

Thank you for noticing! The tests have passed for a uniform number of sources and receivers.

However, I anticipate that if we vary these numbers, the kernel launch and setup may not adequately cover both. So, this line should remain consistent across all 2D grid launch calls and setups. I’ll make sure to enhance it accordingly.

Ok, great! A good reccomandation is to never test codes with forced symmetries as many bugs could pass unnoticed (i sometimes make this mistake myself 😆 )

mrava87 · 2024-11-02T19:44:28Z

pylops/waveeqprocessing/_kirchhoff_cuda.py

+                                                                                        itrav_d, travd_d)
+
+        if not self.dynamic:
+            cuda.synchronize()


Why is this needed since for not self.dynamic you use only the default stream?

for the dynamic case, for each stream, I need to synchronize it with the main stream in a looped sequence. This synchronization happens within self._process_streams().

Sure, but you have if not self.dynamic: so this will never be run in the dynamic case anyways?

mrava87 · 2024-11-02T19:47:50Z

@cako (aka our gpu guru): if you have time, it would be good if you can also have a quick look into this 😄

mrava87 · 2024-11-07T15:57:12Z

@Amnah2020 checking you have seen my review. After you have been able to answer my questions, I will push some minor code modifications and we will be able to progress to the second part of the PR.

Amnah2020 · 2024-11-08T07:34:58Z

I will remove the corresponding section for traverscrec=False.
First, I want to clarify the purpose of cuda.to_device. This function allocates memory on the GPU and transfers a NumPy array to it. While I assume the input will always be a NumPy array, cuda.to_device handles device/host communication, primarily by maintaining a pointer to the data location on the GPU.
- If we decide to transfer all arrays in advance, it would likely follow a similar implementation strategy to the _data_prep_dynamic function.
- However, this approach may not apply to the results of intermediate forward and adjoint operator functions.
I’m not entirely sure what ToCupy does, and I'm also unfamiliar with standard CUDA CuPy functions. Previously, I used custom C++ CuPy functions, which follow a similar strategy to my implementation with Numba CUDA. From what you’ve shared, and based on some research, CuPy enables automatic conversion between NumPy and CuPy arrays, introducing some overhead. In contrast, Numba CUDA requires explicit memory management.
Alternatively, depending on what ToCupy does, cuda.to_device() can accept CuPy arrays as well as NumPy arrays. This suggests that no data transfer would occur; only a pointer transfer (I assume). If we adopt this design, I’ll need a deeper understanding of how memory management works between the two libraries, though it could introduce additional dependencies to maintain.

Maybe you could share the ToCupy function, and I can explore different strategies to leverage it. The goal would be to achieve a similar high-level call structure for full processing of Kirchhoff with multiple sources at once in the device.

mrava87 · 2024-11-08T08:31:15Z

Thanks @Amnah2020, let me push the modifications to the code I already made (later today) so:

I will remove the corresponding section for traverscrec=False.

This will be already gone

First, I want to clarify the purpose of cuda.to_device. This function allocates memory on the GPU and transfers a NumPy array to it. While I assume the input will always be a NumPy array, cuda.to_device handles device/host communication, primarily by maintaining a pointer to the data location on the GPU.

I see your point and I am familiar with the cuda.to_device method of numba, but because of this https://docs.cupy.dev/en/stable/user_guide/interoperability.html#numba, this is not needed if one has already arrays on GPU in cupy type. Numba/cupy handle everything at their interface with a zero-copy operation, so the best of both worlds 😄 It is true that sometimes you may not have already the array on the GPU, but this is where the ToCupy operator comes in, I personally prefer the user to choose and not doing things hidden internally. One issue with your approach is that for chained operators, like Kirchooff, your final operation will move the array back to numpy and the Convolve1d pass will be much slower....

If we decide to transfer all arrays in advance, it would likely follow a similar implementation strategy to the _data_prep_dynamic function.

Let's discuss this after i push the updated cod.

However, this approach may not apply to the results of intermediate forward and adjoint operator functions.

Not sure what you mean?

I’m not entirely sure what ToCupy does, and I'm also unfamiliar with standard CUDA CuPy functions. Previously, I used custom C++ CuPy functions, which follow a similar strategy to my implementation with Numba CUDA. From what you’ve shared, and based on some research, CuPy enables automatic conversion between NumPy and CuPy arrays, introducing some overhead. In contrast, Numba CUDA requires explicit memory management.

No. You can do the same in Cupy or Numba. In either case if the array is on the CPU. In our experience, interoperating numba and cupy does not come with any overhead (and this is how PyLops work). We prefer using cupy for everything GPU related other than custom kernels (which is true, are easier to write in numba-cuda than cupy custom kernels), and I would like to make sure this new code follows the same philosophy to avoid creating too many different coding patterns without a real reason.

Alternatively, depending on what ToCupy does, cuda.to_device() can accept CuPy arrays as well as NumPy arrays. This suggests that no data transfer would occur; only a pointer transfer (I assume). If we adopt this design, I’ll need a deeper understanding of how memory management works between the two libraries, though it could introduce additional dependencies to maintain.

Ok, interesting. My understanding, from the link above, was that cuda.to_device() can be completely omitted if the array is already a cupy array... let's see what @cako thinks. And no new dependencies, cupy is already heavily used in pylops.

Maybe you could share the ToCupy function, and I can explore different strategies to leverage it. The goal would be to achieve a similar high-level call structure for full processing of Kirchhoff with multiple sources at once in the device.

Doing soon :)

mrava87 · 2024-11-08T11:21:37Z

@Amnah2020 codes and notebook updated.

I am also going to open a new Issue explaining in more details the philosophy behind ToCupy, as this is just a use case, but I think many other problems (if implemented on GPU) would require something like this to scale properly. Whilst there I talk about this in general terms, I attach here a figure showing how the current implementation falls short when you chain operators and how the new one can solve this issue.

cako · 2024-11-13T04:31:06Z

Sorry, pretty busy at work, haven't been able to look at this yet :(

mrava87 · 2024-11-13T09:59:08Z

Sorry, pretty busy at work, haven't been able to look at this yet :(

No rush, we can wait for you ;)

cako · 2024-11-16T18:33:26Z

pylops/waveeqprocessing/_kirchhoff_cuda.py

+            streams = [cuda.stream() for _ in range(num_streams)]
+            sources_per_stream = num_sources // num_streams
+            remainder = num_sources % num_streams
+            source_counter = 0
+            self.sources_per_streams = {}
+            for i, stream in enumerate(streams):
+                num_sources_for_stream = sources_per_stream + (
+                    1 if i < remainder else 0
+                )
+                sources_for_stream = list(
+                    range(source_counter, source_counter + num_sources_for_stream)
+                )
+                self.sources_per_streams[stream] = sources_for_stream
+                source_counter += num_sources_for_stream


You can substitute all this by:

# Split sources into different processing streams self.sources_per_streams = { cuda.stream(): sources for sources in np.array_split(range(self.ns), self.nstreams) } # Check streams are unique assert ( len(set(s.handle.value for s in self.sources_per_streams.keys())) == self.nstreams )

cako · 2024-11-16T18:58:47Z

pylops/waveeqprocessing/_kirchhoff_cuda.py

+        aperturetap_d = cuda.to_device(aperturetap)
+        vel_d = cuda.to_device(vel)
+        six_d = cuda.to_device(six)
+        rix_d = cuda.to_device(rix)
+        self.const_inputs = (
+            ns_d,
+            nr_d,
+            nt_d,
+            ni_d,
+            nz_d,
+            dt_d,
+            aperturemin_d,
+            aperturemax_d,
+            angleaperturemin_d,
+            angleaperturemax_d,
+            vel_d,
+            aperturetap_d,
+            six_d,
+            rix_d,
+        )
+
+        self.trav_recs_d_global = cuda.to_device(trav_recs)
+        self.angles_recs_d_global = cuda.to_device(angle_recs)
+        self.trav_srcs_d_global = cuda.to_device(trav_srcs)
+        self.angles_srcs_d_global = cuda.to_device(angle_srcs)
+        self.amp_srcs_d_global = cuda.to_device(amp_srcs)
+        self.amp_recs_d_global = cuda.to_device(amp_recs)


I'm not sure how much overhead these to_device are adding, but if you'd like to make this more efficient, you can queue the memory transfers per stream and use pinned arrays to ensure async. Here's a snippet (untested):

streams = list(self.sources_per_streams.keys()) n = len(streams) with cuda.pinned( aperturetap, vel, six, rix, trav_recs, angle_recs, trav_srcs, angle_srcs, amp_srcs, amp_recs, ): aperturetap_d = cuda.to_device(aperturetap, stream=streams[0 % n]) vel_d = cuda.to_device(vel, stream=streams[1 % n]) six_d = cuda.to_device(six, stream=streams[2 % n]) rix_d = cuda.to_device(rix, stream=streams[3 % n]) self.trav_recs_d_global = cuda.to_device(trav_recs, stream=streams[4 % n]) self.angles_recs_d_global = cuda.to_device( angle_recs, stream=streams[5 % n] ) self.trav_srcs_d_global = cuda.to_device(trav_srcs, stream=streams[6 % n]) self.angles_srcs_d_global = cuda.to_device( angle_srcs, stream=streams[7 % n] ) self.amp_srcs_d_global = cuda.to_device(amp_srcs, stream=streams[8 % n]) self.amp_recs_d_global = cuda.to_device(amp_recs, stream=streams[9 % n]) cuda.synchronize()

cako · 2024-11-16T19:15:11Z

pylops/waveeqprocessing/_kirchhoff_cuda.py

+            y = np.zeros((self.ns * self.nr, self.nt))
+        elif opt == "_rmatvec":
+            y = np.zeros(self.ni)


Maybe these should be zeros(dtype=np.float32)?

cako · 2024-11-16T19:16:49Z

pylops/waveeqprocessing/_kirchhoff_cuda.py

+            x_d = cuda.to_device(x, stream=stream)
+            y_d_dict[stream] = cuda.to_device(y, stream=stream)
+            isrc_list_d_dict[stream] = cuda.to_device(isrc_list, stream=stream)


These should be pinned to allow async:

with cuda.pinned(x, y, isrc_list): x_d = cuda.to_device(x, stream=stream) y_d_dict[stream] = cuda.to_device(y, stream=stream) isrc_list_d_dict[stream] = cuda.to_device(isrc_list, stream=stream)

cako · 2024-11-16T19:18:00Z

pylops/waveeqprocessing/_kirchhoff_cuda.py

+            x_d = cuda.to_device(x, stream=stream)
+            y_d_dict[stream] = cuda.to_device(y, stream=stream)


Are x and y being copied multiple times per stream? Not sure they should...

cako · 2024-11-16T19:19:03Z

pylops/waveeqprocessing/_kirchhoff_cuda.py

+        for stream in self.sources_per_streams.keys():
+            stream.synchronize()


I'm pretty sure this can be substituted with cuda.synchronize, but that would also sync the default stream.

cako · 2024-11-16T19:22:24Z

pylops/waveeqprocessing/_kirchhoff_cuda.py

+            y = np.zeros((self.ns * self.nr, self.nt))
+        elif opt == "_rmatvec":
+            y = np.zeros(self.ni)


You probably want to create pinned memory y arrays to facilitate copy to host. Example:

y = cuda.pinned_array((self.ns * self.nr, self.nt), dtype=np.float32)

cako · 2024-11-16T19:27:25Z

pylops/waveeqprocessing/_kirchhoff_cuda.py

+            stream.synchronize()
+        y_streams = []
+        for stream, y_dev in y_d_dict.items():
+            y_streams.append(y_dev.copy_to_host(stream=stream))


When copying to host, it's better to provide a target array which is pinned. Example:

for stream, y_dev in y_d_dict.items(): ary = cuda.pinned_array_like(y_dev) y_dev.copy_to_host(ary, stream=stream) y_streams.append(ary)

pylops/waveeqprocessing/_kirchhoff_cuda.py

cako · 2024-11-16T22:55:55Z

pylops/waveeqprocessing/_kirchhoff_cuda.py

+                        aptap = (
+                            aptap
+                            * aperturetap[


aptap *= ...

Several lines have the same pattern

cako · 2024-11-17T00:01:36Z

Hey guys, great work! I left some comments, mostly around memory transfer assuming that to_device will be used.

I guess the question is, should CuPy or raw Numba arrays be used? To make sure we're all on the same page, for Numba kernels, it makes no difference. Numba and CuPy arrays implement the CUDA Array Interface so you can just call a cuda-jitted kernel directly on them. Same thing goes for PyTorch (not Jax, I think because they only support the v2 of the interface, not the latest v3).

The difference comes from how memory is managed. With CuPy, memory is managed automatically via Memory Pool. The user requests memory and the memory pool decides when and how to allocate it. Versus Numba which leaves memory allocation entirely on the hands of the user. For higher level libraries, managed memory allocators have historically been favored. CuPy has its own, PyTorch, and RAPIDS. They make it simpler to deal with memory, generally offer better performance and allow the user to set global memory limits (useful when sharing resources).

Personally, I feel like PyLops operators should do as little magic as possible internally, and it should be up to the user to manage memory. Of course, this cannot be done to the extreme, as there may be intermediate steps where memory allocation is required.

In the case of the Kirchhoff operator, the behavior that would make more sense to me is the following: when selecting cuda backend, all input array must be consumable by Numba kernels. They can be CuPy, PyTorch, Numba, etc, and possibly a combination of them. The user should be responsible for allocating them. The return arrays IMO should be CuPy, ensuring that they use the currently active memory pool which is already available to most workflows (since we use CuPy extensively).

mrava87 · 2024-12-05T06:31:46Z

Thanks @cako!

I was waiting for @Amnah2020 to reply, but I'll go ahead and provide my feedback on your comments:

General comment:

Hey guys, great work! I left some comments, mostly around memory transfer assuming that to_device will be used.

I guess the question is, should CuPy or raw Numba arrays be used? To make sure we're all on the same page, for Numba kernels, it makes no difference. Numba and CuPy arrays implement the CUDA Array Interface so you can just call a cuda-jitted kernel directly on them. Same thing goes for PyTorch (not Jax, I think because they only support the v2 of the interface, not the latest v3).

This is my understanding as well from other projects. If you have a cupy array and pass it to a numba-cuda custom kernel call there will be no-copy 'data transfer' between the two interfaces thanks to the CUDA Array Interface, so the user should not do any explicit call to move a cupy array into a numba one. So when we write a kernel that assumes the input to be already provided as cupy array we should not do any data movement - this is already the case in _call_dynamic for example 😄

The difference comes from how memory is managed. With CuPy, memory is managed automatically via Memory Pool. The user requests memory and the memory pool decides when and how to allocate it. Versus Numba which leaves memory allocation entirely on the hands of the user. For higher level libraries, managed memory allocators have historically been favored. CuPy has its own, PyTorch, and RAPIDS. They make it simpler to deal with memory, generally offer better performance and allow the user to set global memory limits (useful when sharing resources).

Makes sense to me, and indeed I would favour managed memory allocation if/when that does not bring any downside as it simplifies the user experience.

Personally, I feel like PyLops operators should do as little magic as possible internally, and it should be up to the user to manage memory. Of course, this cannot be done to the extreme, as there may be intermediate steps where memory allocation is required.

Agree.

In the case of the Kirchhoff operator, the behavior that would make more sense to me is the following: when selecting cuda backend, all input array must be consumable by Numba kernels. They can be CuPy, PyTorch, Numba, etc, and possibly a combination of them. The user should be responsible for allocating them. The return arrays IMO should be CuPy, ensuring that they use the currently active memory pool which is already available to most workflows (since we use CuPy extensively).

I also agree with this in general, but with a caveat. Let's take the non-dynamic codes first. The way I have adapted them in the PR fully embrace this philosophy. When the problem is small / GPU VRAM is large enough, the user can lift x/y to the GPU and then pass them to the matvec/rmatvec calls. And this is pretty much in line with any other PyLops operator GPU implementation. In general, we also give user the responsability to pass any additional parameter already as CuPy array (eg wavelet). However, in this specific case, the __init__ method may compute the traveltime tables (if not explicitely provided) and there I find it a bit akward to ask the user to do something like this:

Kop = Kirchhoff(...)
Kop.trav_srcs = cp.asarray(trav_srcs)
Kop.trav_recs = cp.asarray(trav_recs)

So line 559 trav_srcs_d = to_cupy(inputs[7]), which I acknowledge does not falls into the category little magic as possible internally... what do you think?

Now to the dynamic case. The way the code is currently set up departs a bit from the way the non-dynamic code is set up as it uses streams... I think the main motivation here is that since you have more tables (traveltimes, amplitudes, angles), it is likely that even in small examples one cannot just lift all of them in one go to the GPU VRAM, and splitting them with stream could/should allow some overlap between data movements and computations... now my question to you: do you see this as a viable solution to the point we accept to do some magic internally or not?

My dream (not sure how to achieve this yet) would however be to lift this one level up, perhaps inside the ToCupy operator, so that one could run multiple GPU operators in a VStack/BlockDiag on different streams instead of embedding that logic in each single operator... but I think we can be pragmatic and finalize this PR as is and then if we achieve the dream solution, we can adapt the dynamic code to use it.

Specific comments: they all make sense to me and some seem to suggest that we can simplify some of the codebase a lot.. I'll leave it to Amnah to look at them and incorporate them before we can merge this PR.

PS: I removed all codes that I had added for the ToCupy operator as this has been already merged in a different PR... and didn't really need to be here

Amnah2020 added 3 commits October 20, 2024 16:27

Kirchhoff cuda integrated and tested successfully

8a3f151

removed nvtx from kirchhoff_cuda helper, make tests passed with warning.

0524e0e

added comments in kirchhoff_cuda helper

a7aca80

mrava87 reviewed Nov 2, 2024

View reviewed changes

mrava87 requested a review from cako November 2, 2024 19:46

mrava87 added 2 commits November 8, 2024 13:50

feature: added ToCupy operator

fbdbcfb

minor: cleanup of kirchhoff and _kirchhoff_cuda

de47856

mrava87 mentioned this pull request Nov 8, 2024

ToCupy operator #620

Closed

mrava87 added 3 commits November 8, 2024 14:33

ci: python3.10 in azurepipelines

708a3f4

minor: restructured some internals of _kirchhoff_cuda

801c971

ci: python3.11 in azurepipelines

2eef02e

cako reviewed Nov 16, 2024

View reviewed changes

pylops/waveeqprocessing/_kirchhoff_cuda.py Show resolved Hide resolved

cako reviewed Nov 16, 2024

View reviewed changes

mrava87 added 4 commits December 4, 2024 16:57

minor: removed tocupy as already merged in previous PR

0a25fa2

minor: removed spaces

ea98081

minor: removed functions in backend

b8d4ee3

minor: removed commented code

9087959

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kirchhoff cuda integrated and tested successfully #617

Kirchhoff cuda integrated and tested successfully #617

Amnah2020 commented Oct 20, 2024

mrava87 commented Nov 2, 2024 •

edited

Loading

mrava87 Nov 2, 2024

Amnah2020 Nov 8, 2024 •

edited

Loading

mrava87 Nov 8, 2024

Amnah2020 Nov 13, 2024

mrava87 Nov 2, 2024

Amnah2020 Nov 8, 2024

mrava87 Nov 8, 2024

mrava87 Nov 2, 2024

Amnah2020 Nov 8, 2024

mrava87 Nov 8, 2024

Amnah2020 Nov 13, 2024

mrava87 commented Nov 2, 2024

mrava87 commented Nov 7, 2024

Amnah2020 commented Nov 8, 2024 •

edited

Loading

mrava87 commented Nov 8, 2024 •

edited

Loading

mrava87 commented Nov 8, 2024 •

edited

Loading

cako commented Nov 13, 2024

mrava87 commented Nov 13, 2024

cako Nov 16, 2024 •

edited

Loading

cako Nov 16, 2024

cako Nov 16, 2024

cako Nov 16, 2024

cako Nov 16, 2024

cako Nov 16, 2024

cako Nov 16, 2024 •

edited

Loading

cako Nov 16, 2024 •

edited

Loading

cako Nov 16, 2024

cako commented Nov 17, 2024

mrava87 commented Dec 5, 2024 •

edited

Loading

		x_d = cuda.to_device(x, stream=stream)
		y_d_dict[stream] = cuda.to_device(y, stream=stream)

		for stream in self.sources_per_streams.keys():
		stream.synchronize()

Kirchhoff cuda integrated and tested successfully #617

Are you sure you want to change the base?

Kirchhoff cuda integrated and tested successfully #617

Conversation

Amnah2020 commented Oct 20, 2024

mrava87 commented Nov 2, 2024 • edited Loading

Choose a reason for hiding this comment

Amnah2020 Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrava87 commented Nov 2, 2024

mrava87 commented Nov 7, 2024

Amnah2020 commented Nov 8, 2024 • edited Loading

mrava87 commented Nov 8, 2024 • edited Loading

mrava87 commented Nov 8, 2024 • edited Loading

cako commented Nov 13, 2024

mrava87 commented Nov 13, 2024

cako Nov 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cako Nov 16, 2024 • edited Loading

Choose a reason for hiding this comment

cako Nov 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cako commented Nov 17, 2024

mrava87 commented Dec 5, 2024 • edited Loading

mrava87 commented Nov 2, 2024 •

edited

Loading

Amnah2020 Nov 8, 2024 •

edited

Loading

Amnah2020 commented Nov 8, 2024 •

edited

Loading

mrava87 commented Nov 8, 2024 •

edited

Loading

mrava87 commented Nov 8, 2024 •

edited

Loading

cako Nov 16, 2024 •

edited

Loading

cako Nov 16, 2024 •

edited

Loading

cako Nov 16, 2024 •

edited

Loading

mrava87 commented Dec 5, 2024 •

edited

Loading