Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluestein Algorithm #140

Closed
wants to merge 76 commits into from
Closed

Bluestein Algorithm #140

wants to merge 76 commits into from

Conversation

AD2605
Copy link
Contributor

@AD2605 AD2605 commented Feb 6, 2024

Adds support for sizes which has prime factors for both INTERLEAVED_COMPLEX and SPLIT_COMPLEX layout.

Checklist

Tick if relevant:

  • New files have a copyright
  • New headers have an include guards
  • API is documented with Doxygen
  • New functionalities are tested
  • Tests pass locally
  • Files are clang-formatted

@AD2605 AD2605 marked this pull request as draft February 6, 2024 10:57
Copy link
Contributor

@hjabird hjabird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an incompete review since I know this is draft DFT and you said that some things in the dispatch logic will change.

src/portfft/common/bluestein.hpp Outdated Show resolved Hide resolved
* @tparam T Scalar Type
* @param ptr Host Pointer containing the load/store modifiers.
* @param committed_size original problem size
* @param dimension_size padded size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If padded size is the best description for dimension_size, why isn't the variable called padded_size?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whichever name is better, this documentation should be improved.

* @param fft_size fft size
*/
template <typename T>
void naive_dft(std::complex<T>* input, std::complex<T>* output, IdxGlobal fft_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big a DFT do we expect to compute with this? Can we use portFFT's SYCL implementation here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any size FFTs are acceptable, any reason you ask that question ?. In case you are worried about the time to compute, I can replace this with a CT.
We cannot use our kernels here it would require jitting of a new kernel, at that point a host side CT ( which I assume this will become eventually) would be faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the scaling of naive DFTs, we should be careful that calling this function does not become more expensive than the actual DFT we are trying to compute.

I also don't especially like the notion that we'd need a host implementation of CT - we're trying to write a GPU implementation, not a CPU implementation!

Comment on lines 43 to 44
ctype multiplier = ctype(static_cast<T>(std::cos((-2 * M_PI * i * j) / static_cast<double>(fft_size))),
static_cast<T>(std::sin((-2 * M_PI * i * j) / static_cast<double>(fft_size))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • M_PI is not C++.
  • cospi and sinpi.

*/
template <typename T>
void naive_dft(std::complex<T>* input, std::complex<T>* output, IdxGlobal fft_size) {
using ctype = std::complex<T>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctype sounds special in c++. Consider complex_t.

src/portfft/committed_descriptor_impl.hpp Show resolved Hide resolved
src/portfft/committed_descriptor_impl.hpp Outdated Show resolved Hide resolved
@@ -322,7 +316,7 @@ class committed_descriptor_impl {
* vector of kernel ids, factors
*/
template <Idx SubgroupSize>
std::tuple<detail::level, kernel_ids_and_metadata_t> prepare_implementation(std::size_t kernel_num) {
std::tuple<detail::level, std::size_t, kernel_ids_and_metadata_t> prepare_implementation(std::size_t kernel_num) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tuple's new addtion is undocumented. I'd still love this to be its own class.

return {detail::level::GLOBAL, param_vec};
if (!detail::factorize_input(fft_size, check_and_select_target_level)) {
param_vec.clear();
fft_size = static_cast<IdxGlobal>(std::pow(2, ceil(log(static_cast<double>(fft_size)) / log(2.0))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using floating-point arithmatic on integer data is a mistake, especially when we have an integer log2 function here that you could adapt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this operation could also be a named function like round_up_to_pow_2.

Comment on lines 429 to 430
detail::factorize_input(fft_size, check_and_select_target_level);
detail::factorize_input(fft_size, check_and_select_target_level);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have provided the flexibility of having different number of num_forward_factors (as in the number of forward factors during the forward FFT phase) and num_backward_factors( as in the number of backward factors during the backward phase of bluestein) , and the following reason is why -

  1. I will soon be removing the use local memory for the load/store modifiers and replacing it with single transaction vector loads from global memory. This will reduce the local memory requirements, and thus in turn reduce the number of factors, which in-turn will reduce the trips to global memory and number of transpositions. This is also the reason why I have not used local memory for load modifiers in the subgroup implementation.

  2. However, I have still kept the option of using local memory. This is because some architectures do not support coalesced vector loads. In that case we would need to use local memory, and thus the number of factors would increase, as the factor size is driven by the local memory usage. In Bluestein, only the first kernel during the forward phase will require load modifier (which will go into local memory in this case), and the backward phase does not. And thus it of interest to reduce the number of factors in the backward phase (as it means lesser transpositions, lesser trips to global memory, and thus more performant), hence the number of backward_factors should be lesser than number of forward_factors.

  3. Keeping point 2 in mind, throughout the code, I have written with the assumption that num_forward_factors != num_backward_factors, and whenever we make the change to use local memory only for some hardware, nothing would need to change except for the factorization. For completeness have a look at how factorization was being done when the de facto way was to use local memory for load modifiers here, focusing on the boolean at the end. So to answer your question, its called twice so as to provide the flexibility of having different number of forward and backward factors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think point 1 the sort of big design discussions we should have as a team before much code is written.

I guess there is some state associated with check_and_select_target_level that means it does something different after each call?
This isn't immediately clear. Maybe a good start is explicitly defining the lambda captures?
Where possible, I find pure functions easier to follow.

Copy link
Contributor Author

@AD2605 AD2605 Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there is some state associated with check_and_select_target_level that means it does something different after each call?
This isn't immediately clear. Maybe a good start is explicitly defining the lambda captures?
Where possible, I find pure functions easier to follow.

There is no state associated with check_and_select_target_impl. Each call to that function independent of the previous one. I very strongly prefer lambdas in this case as this is the only place the function is required.

I think point 1 the sort of big design discussions we should have as a team before much code is written.

I would say that this does not subtract / change any logic, or permanently change things. This assumes a general case and does not make any strong assumptions and provides flexibility to go the other way if required, so I suppose it is fine ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that factorization for global is a really confusing piece of code, which should be refactored. The confusing part here is that factorize_input calls check_and_select_target_level, which appends factors to param_vec, so calling it twice here actually makes sense.

But that has not been introduced in this PR, so it makes sense to leave refactoring for a separate task.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this because

  1. It really looks like a bug.
  2. It is hard to follow.

Would the following be accurate?

Suggested change
detail::factorize_input(fft_size, check_and_select_target_level);
detail::factorize_input(fft_size, check_and_select_target_level);
// Forward DFT within Bluestein implementation (pre convolution).
detail::factorize_input(fft_size, check_and_select_target_level);
// Backward DFT within Bluestein implementation (post convolution).
detail::factorize_input(fft_size, check_and_select_target_level);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| There is no state associated with check_and_select_target_impl

check_and_select_target_impl modifies param_vec. When I mentioned pure functions, I wasn't trying to say there is anything wrong with lambdas, just unexpected side effects.

Comment on lines +177 to +181
/**
* Tuple of the level, an input kernel_bundle, and factors pertaining to each factor of the committed size
*/
using input_bundles_and_metadata_t =
std::tuple<detail::level, sycl::kernel_bundle<sycl::bundle_state::input>, std::vector<Idx>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you create a comment describing a tuples contents, it might be time to create a POD struct.

bool is_compatible = true;
for (auto [level, ids, factors] : prepared_vec) {
std::size_t temp = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temp could probably have a better name

}

Idx num_backward_factors = static_cast<Idx>(prepared_vec.size()) - num_forward_factors;
bool is_prime = dimension_size == params.lengths[dimension_num] ? false : true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool is_prime = dimension_size == params.lengths[dimension_num] ? false : true;
bool is_prime = dimension_size != params.lengths[dimension_num];

What does is_prime mean? surely we are more concerned with large prime factors than primes in general. Maybe something like "is_padded" would be more appropriate.

src/portfft/committed_descriptor_impl.hpp Show resolved Hide resolved
@AD2605 AD2605 marked this pull request as ready for review February 7, 2024 10:27
Comment on lines 429 to 430
detail::factorize_input(fft_size, check_and_select_target_level);
detail::factorize_input(fft_size, check_and_select_target_level);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think point 1 the sort of big design discussions we should have as a team before much code is written.

I guess there is some state associated with check_and_select_target_level that means it does something different after each call?
This isn't immediately clear. Maybe a good start is explicitly defining the lambda captures?
Where possible, I find pure functions easier to follow.

Comment on lines 568 to 573
detail::complex_conjugate conjugate_on_load;
detail::complex_conjugate conjugate_on_store;
detail::elementwise_multiply multiply_on_load;
detail::elementwise_multiply multiply_on_store;
detail::apply_scale_factor scale_factor_applied;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
detail::complex_conjugate conjugate_on_load;
detail::complex_conjugate conjugate_on_store;
detail::elementwise_multiply multiply_on_load;
detail::elementwise_multiply multiply_on_store;
detail::apply_scale_factor scale_factor_applied;

These could just be declared in the loop since you redefine them anyway

src/portfft/committed_descriptor_impl.hpp Show resolved Hide resolved
try {
PORTFFT_LOG_TRACE("Building kernel bundle with subgroup size", SubgroupSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to delete this log line?

}
if (is_compatible) {
return result;
if (!is_compatible) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_compatible isn't needed, just return nullopt immediately where it is set to false

Comment on lines 53 to 68
std::vector<sycl::event> compute_level(const typename committed_descriptor_impl<Scalar, Domain>::kernel_data_struct&,
const TIn&, Scalar*, const TIn&, Scalar*, const Scalar*, const Scalar*,
const Scalar*, const IdxGlobal*, IdxGlobal, IdxGlobal, Idx, IdxGlobal, IdxGlobal,
Idx, Idx, complex_storage, const std::vector<sycl::event>&, sycl::queue&);

template <typename Scalar, domain Domain, typename TOut>
sycl::event transpose_level(const typename committed_descriptor_impl<Scalar, Domain>::kernel_data_struct& kd_struct,
const Scalar* input, TOut output, const IdxGlobal* factors_triple, IdxGlobal committed_size,
Idx num_batches_in_l2, IdxGlobal n_transforms, IdxGlobal batch_start, Idx total_factors,
IdxGlobal output_offset, sycl::queue& queue, const std::vector<sycl::event>& events,
complex_storage storage);
sycl::event transpose_level(const typename committed_descriptor_impl<Scalar, Domain>::kernel_data_struct&,
const Scalar*, TOut, const IdxGlobal*, IdxGlobal, Idx, IdxGlobal, IdxGlobal, Idx, IdxGlobal,
sycl::queue&, const std::vector<sycl::event>&, complex_storage);

template <Idx, typename Scalar, domain Domain, typename TIn, typename TOut>
sycl::event global_impl_driver(const TIn&, const TIn&, TOut, TOut, committed_descriptor_impl<Scalar, Domain>&,
typename committed_descriptor_impl<Scalar, Domain>::dimension_struct&,
const kernels_vec<Scalar, Domain>&, const kernels_vec<Scalar, Domain>&, Idx, IdxGlobal,
IdxGlobal, std::size_t, std::size_t, IdxGlobal, IdxGlobal, IdxGlobal, complex_storage,
detail::elementwise_multiply, const Scalar*);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it more readable to have argument names here as well. Especially as each function accepts many arguments of the same type.

Comment on lines 157 to 174
friend std::vector<sycl::event> compute_level(
const typename committed_descriptor_impl<Scalar1, Domain1>::kernel_data_struct&, const TIn&, Scalar1*, const TIn&,
Scalar1*, const Scalar1*, const Scalar1*, const Scalar1*, const IdxGlobal*, IdxGlobal, IdxGlobal, Idx, IdxGlobal,
IdxGlobal, Idx, Idx, complex_storage, const std::vector<sycl::event>&, sycl::queue&);

template <typename Scalar1, domain Domain1, typename TOut>
friend sycl::event detail::transpose_level(
const typename committed_descriptor_impl<Scalar1, Domain1>::kernel_data_struct& kd_struct, const Scalar1* input,
TOut output, const IdxGlobal* factors_triple, IdxGlobal committed_size, Idx num_batches_in_l2,
IdxGlobal n_transforms, IdxGlobal batch_start, Idx total_factors, IdxGlobal output_offset, sycl::queue& queue,
const std::vector<sycl::event>& events, complex_storage storage);
const typename committed_descriptor_impl<Scalar1, Domain1>::kernel_data_struct&, const Scalar1*, TOut,
const IdxGlobal*, IdxGlobal, Idx, IdxGlobal, IdxGlobal, Idx, IdxGlobal, sycl::queue&,
const std::vector<sycl::event>&, complex_storage);

template <Idx, typename Scalar1, domain Domain1, typename TIn, typename TOut>
friend sycl::event global_impl_driver(const TIn&, const TIn&, TOut, TOut,
committed_descriptor_impl<Scalar1, Domain1>&,
typename committed_descriptor_impl<Scalar1, Domain1>::dimension_struct&,
const kernels_vec<Scalar1, Domain1>&, const kernels_vec<Scalar1, Domain1>&, Idx,
IdxGlobal, IdxGlobal, std::size_t, std::size_t, IdxGlobal, IdxGlobal, IdxGlobal,
complex_storage, detail::elementwise_multiply, const Scalar1*);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@@ -215,17 +229,29 @@ class committed_descriptor_impl {
std::shared_ptr<IdxGlobal> factors_and_scan;
detail::level level;
std::size_t length;
std::size_t committed_length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is committed length and how does it differ from just length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

committed length is the length that is committed corresponding to that dimension.
Now, the problem size needn't be the size that was committed, hence the two variables. Notice how prepare_implementation is now returning the problem size as well (here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case does it differ from length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prime sized inputs, where we need to pad (Bluestein)
and in the future, in the case of Rader FFTs, where this will becomes committed_length - 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I think the variable name could be improved. length is what user commits and that is what I think of when I read committed_length. How about impl_ct_length? Also add a comment explaining when this differs from length.

return {detail::level::GLOBAL, param_vec};
if (!detail::factorize_input(fft_size, check_and_select_target_level)) {
param_vec.clear();
fft_size = static_cast<IdxGlobal>(std::pow(2, ceil(log(static_cast<double>(fft_size)) / log(2.0))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike changing this variable in place - fft size is not changing, but you are using the same variable for something else. Use a new variable. rounded_up_fft_size maybe?

Comment on lines 429 to 430
detail::factorize_input(fft_size, check_and_select_target_level);
detail::factorize_input(fft_size, check_and_select_target_level);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that factorization for global is a really confusing piece of code, which should be refactored. The confusing part here is that factorize_input calls check_and_select_target_level, which appends factors to param_vec, so calling it twice here actually makes sense.

But that has not been introduced in this PR, so it makes sense to leave refactoring for a separate task.

* @return sycl::event waiting on the last transposes
*/
template <Idx SubgroupSize, typename Scalar, domain Domain, typename TIn, typename TOut>
sycl::event global_impl_driver(const TIn& input, const TIn& input_imag, TOut output, TOut output_imag,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like putting driver in function names. It is such an overloaded term, which makes it less clear. There are GPU drivers, compiler driver ... Can you think of another name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, and I genuinely cannot come up with a better name, and I feel the driver suffix is apt,
because this function launches the compute and transpose kernels, manages dependencies, and increments pointer per level. It "drives" the global implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just global_impl? I know this is a bit of nitpicking so feel free to ignore if you disagree.

src/portfft/common/global.hpp Show resolved Hide resolved
Comment on lines 119 to 126
auto in_acc_or_usm = get_access(input, cgh);
auto out_acc_or_usm = get_access(output, cgh);
cgh.host_task([&]() {
for (std::size_t i = 0; i < num_copies; i++) {
events.push_back(queue.copy(&in_acc_or_usm[0] + i * src_stride + input_offset,
&out_acc_or_usm[0] + i * dst_stride + output_offset, num_elements_to_copy));
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is undefined behavior for buffers. queue.copy is only defined for USM pointers, NOT pointers to buffer memory. However there are queue.copy overloads that accept accessors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know that, thanks for letting me know.
I will look into it

}

template <typename Scalar, domain Domain>
template <typename Dummy>
struct committed_descriptor_impl<Scalar, Domain>::calculate_twiddles_struct::inner<detail::level::GLOBAL, Dummy> {
static Scalar* execute(committed_descriptor_impl& desc, dimension_struct& /*dimension_data*/,
static Scalar* execute(committed_descriptor_impl& desc, dimension_struct& dimension_data,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did twiddle calculation need to change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To calculate the twiddles for both forward and backward factors, I have refactored it a bit to avoid code duplication.

Comment on lines +184 to +191
if (ptr != nullptr) {
return std::shared_ptr<T>(ptr, [captured_queue = queue](T* ptr) {
if (ptr != nullptr) {
sycl::free(ptr, captured_queue);
}
});
}
throw internal_error("Could not allocate usm memory of size: ", size * sizeof(T), " bytes");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (ptr != nullptr) {
return std::shared_ptr<T>(ptr, [captured_queue = queue](T* ptr) {
if (ptr != nullptr) {
sycl::free(ptr, captured_queue);
}
});
}
throw internal_error("Could not allocate usm memory of size: ", size * sizeof(T), " bytes");
if (ptr == nullptr) {
throw internal_error("Could not allocate usm memory of size: ", size * sizeof(T), " bytes");
}
return std::shared_ptr<T>(ptr, [captured_queue = queue](T* ptr) {
if (ptr != nullptr) {
sycl::free(ptr, captured_queue);
}
});

This would look better with less nesting.

Comment on lines +219 to +221
if (conjugate_on_store == detail::complex_conjugate::APPLIED) {
conjugate_inplace(priv, fft_size);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should always conjugate before the logging or after the logging. I realize that we had 1 before and 1 after already, but both before makes more sense to me.

@AD2605 AD2605 marked this pull request as draft February 20, 2024 11:10
@AD2605
Copy link
Contributor Author

AD2605 commented Mar 5, 2024

The current implementation does not handle small prime values.
It is possible to handle small prime by extending the current changes without actually implementing subgroup and workgroup level bluestein (which is being handled in this branch https://github.com/codeplaysoftware/portFFT/tree/atharva/sg_wg_bluestein), since the scope of this branch is to handle all prime sizes, I will be closing this pull request and re-open it once again once all prime sized values are handled

@AD2605 AD2605 closed this Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants