-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bluestein Algorithm #140
Bluestein Algorithm #140
Conversation
…tors, add transposes for backward factors
This reverts commit 4cf6ca7.
…to atharva/bluestein
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an incompete review since I know this is draft DFT and you said that some things in the dispatch logic will change.
* @tparam T Scalar Type | ||
* @param ptr Host Pointer containing the load/store modifiers. | ||
* @param committed_size original problem size | ||
* @param dimension_size padded size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If padded size is the best description for dimension_size
, why isn't the variable called padded_size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whichever name is better, this documentation should be improved.
src/portfft/common/host_fft.hpp
Outdated
* @param fft_size fft size | ||
*/ | ||
template <typename T> | ||
void naive_dft(std::complex<T>* input, std::complex<T>* output, IdxGlobal fft_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big a DFT do we expect to compute with this? Can we use portFFT's SYCL implementation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any size FFTs are acceptable, any reason you ask that question ?. In case you are worried about the time to compute, I can replace this with a CT.
We cannot use our kernels here it would require jitting of a new kernel, at that point a host side CT ( which I assume this will become eventually) would be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the scaling of naive DFTs, we should be careful that calling this function does not become more expensive than the actual DFT we are trying to compute.
I also don't especially like the notion that we'd need a host implementation of CT - we're trying to write a GPU implementation, not a CPU implementation!
src/portfft/common/host_fft.hpp
Outdated
ctype multiplier = ctype(static_cast<T>(std::cos((-2 * M_PI * i * j) / static_cast<double>(fft_size))), | ||
static_cast<T>(std::sin((-2 * M_PI * i * j) / static_cast<double>(fft_size)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
M_PI
is not C++.cospi
andsinpi
.
src/portfft/common/host_fft.hpp
Outdated
*/ | ||
template <typename T> | ||
void naive_dft(std::complex<T>* input, std::complex<T>* output, IdxGlobal fft_size) { | ||
using ctype = std::complex<T>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctype
sounds special in c++. Consider complex_t
.
@@ -322,7 +316,7 @@ class committed_descriptor_impl { | |||
* vector of kernel ids, factors | |||
*/ | |||
template <Idx SubgroupSize> | |||
std::tuple<detail::level, kernel_ids_and_metadata_t> prepare_implementation(std::size_t kernel_num) { | |||
std::tuple<detail::level, std::size_t, kernel_ids_and_metadata_t> prepare_implementation(std::size_t kernel_num) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tuple's new addtion is undocumented. I'd still love this to be its own class.
return {detail::level::GLOBAL, param_vec}; | ||
if (!detail::factorize_input(fft_size, check_and_select_target_level)) { | ||
param_vec.clear(); | ||
fft_size = static_cast<IdxGlobal>(std::pow(2, ceil(log(static_cast<double>(fft_size)) / log(2.0)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using floating-point arithmatic on integer data is a mistake, especially when we have an integer log2 function here that you could adapt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this operation could also be a named function like round_up_to_pow_2
.
detail::factorize_input(fft_size, check_and_select_target_level); | ||
detail::factorize_input(fft_size, check_and_select_target_level); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this called twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have provided the flexibility of having different number of num_forward_factors
(as in the number of forward factors during the forward FFT phase) and num_backward_factors
( as in the number of backward factors during the backward phase of bluestein) , and the following reason is why -
-
I will soon be removing the use local memory for the load/store modifiers and replacing it with single transaction vector loads from global memory. This will reduce the local memory requirements, and thus in turn reduce the number of factors, which in-turn will reduce the trips to global memory and number of transpositions. This is also the reason why I have not used local memory for load modifiers in the subgroup implementation.
-
However, I have still kept the option of using local memory. This is because some architectures do not support coalesced vector loads. In that case we would need to use local memory, and thus the number of factors would increase, as the factor size is driven by the local memory usage. In Bluestein, only the first kernel during the forward phase will require load modifier (which will go into local memory in this case), and the backward phase does not. And thus it of interest to reduce the number of factors in the backward phase (as it means lesser transpositions, lesser trips to global memory, and thus more performant), hence the number of backward_factors should be lesser than number of forward_factors.
-
Keeping point 2 in mind, throughout the code, I have written with the assumption that
num_forward_factors != num_backward_factors
, and whenever we make the change to use local memory only for some hardware, nothing would need to change except for the factorization. For completeness have a look at how factorization was being done when the de facto way was to use local memory for load modifiers here, focusing on the boolean at the end. So to answer your question, its called twice so as to provide the flexibility of having different number of forward and backward factors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think point 1 the sort of big design discussions we should have as a team before much code is written.
I guess there is some state associated with check_and_select_target_level
that means it does something different after each call?
This isn't immediately clear. Maybe a good start is explicitly defining the lambda captures?
Where possible, I find pure functions easier to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there is some state associated with check_and_select_target_level that means it does something different after each call?
This isn't immediately clear. Maybe a good start is explicitly defining the lambda captures?
Where possible, I find pure functions easier to follow.
There is no state associated with check_and_select_target_impl
. Each call to that function independent of the previous one. I very strongly prefer lambdas in this case as this is the only place the function is required.
I think point 1 the sort of big design discussions we should have as a team before much code is written.
I would say that this does not subtract / change any logic, or permanently change things. This assumes a general case and does not make any strong assumptions and provides flexibility to go the other way if required, so I suppose it is fine ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that factorization for global is a really confusing piece of code, which should be refactored. The confusing part here is that factorize_input
calls check_and_select_target_level
, which appends factors to param_vec
, so calling it twice here actually makes sense.
But that has not been introduced in this PR, so it makes sense to leave refactoring for a separate task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this because
- It really looks like a bug.
- It is hard to follow.
Would the following be accurate?
detail::factorize_input(fft_size, check_and_select_target_level); | |
detail::factorize_input(fft_size, check_and_select_target_level); | |
// Forward DFT within Bluestein implementation (pre convolution). | |
detail::factorize_input(fft_size, check_and_select_target_level); | |
// Backward DFT within Bluestein implementation (post convolution). | |
detail::factorize_input(fft_size, check_and_select_target_level); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| There is no state associated with check_and_select_target_impl
check_and_select_target_impl
modifies param_vec
. When I mentioned pure functions, I wasn't trying to say there is anything wrong with lambdas, just unexpected side effects.
/** | ||
* Tuple of the level, an input kernel_bundle, and factors pertaining to each factor of the committed size | ||
*/ | ||
using input_bundles_and_metadata_t = | ||
std::tuple<detail::level, sycl::kernel_bundle<sycl::bundle_state::input>, std::vector<Idx>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you create a comment describing a tuples contents, it might be time to create a POD struct.
bool is_compatible = true; | ||
for (auto [level, ids, factors] : prepared_vec) { | ||
std::size_t temp = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temp could probably have a better name
} | ||
|
||
Idx num_backward_factors = static_cast<Idx>(prepared_vec.size()) - num_forward_factors; | ||
bool is_prime = dimension_size == params.lengths[dimension_num] ? false : true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool is_prime = dimension_size == params.lengths[dimension_num] ? false : true; | |
bool is_prime = dimension_size != params.lengths[dimension_num]; |
What does is_prime
mean? surely we are more concerned with large prime factors than primes in general. Maybe something like "is_padded" would be more appropriate.
detail::factorize_input(fft_size, check_and_select_target_level); | ||
detail::factorize_input(fft_size, check_and_select_target_level); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think point 1 the sort of big design discussions we should have as a team before much code is written.
I guess there is some state associated with check_and_select_target_level
that means it does something different after each call?
This isn't immediately clear. Maybe a good start is explicitly defining the lambda captures?
Where possible, I find pure functions easier to follow.
detail::complex_conjugate conjugate_on_load; | ||
detail::complex_conjugate conjugate_on_store; | ||
detail::elementwise_multiply multiply_on_load; | ||
detail::elementwise_multiply multiply_on_store; | ||
detail::apply_scale_factor scale_factor_applied; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detail::complex_conjugate conjugate_on_load; | |
detail::complex_conjugate conjugate_on_store; | |
detail::elementwise_multiply multiply_on_load; | |
detail::elementwise_multiply multiply_on_store; | |
detail::apply_scale_factor scale_factor_applied; |
These could just be declared in the loop since you redefine them anyway
try { | ||
PORTFFT_LOG_TRACE("Building kernel bundle with subgroup size", SubgroupSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to delete this log line?
} | ||
if (is_compatible) { | ||
return result; | ||
if (!is_compatible) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_compatible
isn't needed, just return nullopt immediately where it is set to false
std::vector<sycl::event> compute_level(const typename committed_descriptor_impl<Scalar, Domain>::kernel_data_struct&, | ||
const TIn&, Scalar*, const TIn&, Scalar*, const Scalar*, const Scalar*, | ||
const Scalar*, const IdxGlobal*, IdxGlobal, IdxGlobal, Idx, IdxGlobal, IdxGlobal, | ||
Idx, Idx, complex_storage, const std::vector<sycl::event>&, sycl::queue&); | ||
|
||
template <typename Scalar, domain Domain, typename TOut> | ||
sycl::event transpose_level(const typename committed_descriptor_impl<Scalar, Domain>::kernel_data_struct& kd_struct, | ||
const Scalar* input, TOut output, const IdxGlobal* factors_triple, IdxGlobal committed_size, | ||
Idx num_batches_in_l2, IdxGlobal n_transforms, IdxGlobal batch_start, Idx total_factors, | ||
IdxGlobal output_offset, sycl::queue& queue, const std::vector<sycl::event>& events, | ||
complex_storage storage); | ||
sycl::event transpose_level(const typename committed_descriptor_impl<Scalar, Domain>::kernel_data_struct&, | ||
const Scalar*, TOut, const IdxGlobal*, IdxGlobal, Idx, IdxGlobal, IdxGlobal, Idx, IdxGlobal, | ||
sycl::queue&, const std::vector<sycl::event>&, complex_storage); | ||
|
||
template <Idx, typename Scalar, domain Domain, typename TIn, typename TOut> | ||
sycl::event global_impl_driver(const TIn&, const TIn&, TOut, TOut, committed_descriptor_impl<Scalar, Domain>&, | ||
typename committed_descriptor_impl<Scalar, Domain>::dimension_struct&, | ||
const kernels_vec<Scalar, Domain>&, const kernels_vec<Scalar, Domain>&, Idx, IdxGlobal, | ||
IdxGlobal, std::size_t, std::size_t, IdxGlobal, IdxGlobal, IdxGlobal, complex_storage, | ||
detail::elementwise_multiply, const Scalar*); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it more readable to have argument names here as well. Especially as each function accepts many arguments of the same type.
friend std::vector<sycl::event> compute_level( | ||
const typename committed_descriptor_impl<Scalar1, Domain1>::kernel_data_struct&, const TIn&, Scalar1*, const TIn&, | ||
Scalar1*, const Scalar1*, const Scalar1*, const Scalar1*, const IdxGlobal*, IdxGlobal, IdxGlobal, Idx, IdxGlobal, | ||
IdxGlobal, Idx, Idx, complex_storage, const std::vector<sycl::event>&, sycl::queue&); | ||
|
||
template <typename Scalar1, domain Domain1, typename TOut> | ||
friend sycl::event detail::transpose_level( | ||
const typename committed_descriptor_impl<Scalar1, Domain1>::kernel_data_struct& kd_struct, const Scalar1* input, | ||
TOut output, const IdxGlobal* factors_triple, IdxGlobal committed_size, Idx num_batches_in_l2, | ||
IdxGlobal n_transforms, IdxGlobal batch_start, Idx total_factors, IdxGlobal output_offset, sycl::queue& queue, | ||
const std::vector<sycl::event>& events, complex_storage storage); | ||
const typename committed_descriptor_impl<Scalar1, Domain1>::kernel_data_struct&, const Scalar1*, TOut, | ||
const IdxGlobal*, IdxGlobal, Idx, IdxGlobal, IdxGlobal, Idx, IdxGlobal, sycl::queue&, | ||
const std::vector<sycl::event>&, complex_storage); | ||
|
||
template <Idx, typename Scalar1, domain Domain1, typename TIn, typename TOut> | ||
friend sycl::event global_impl_driver(const TIn&, const TIn&, TOut, TOut, | ||
committed_descriptor_impl<Scalar1, Domain1>&, | ||
typename committed_descriptor_impl<Scalar1, Domain1>::dimension_struct&, | ||
const kernels_vec<Scalar1, Domain1>&, const kernels_vec<Scalar1, Domain1>&, Idx, | ||
IdxGlobal, IdxGlobal, std::size_t, std::size_t, IdxGlobal, IdxGlobal, IdxGlobal, | ||
complex_storage, detail::elementwise_multiply, const Scalar1*); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
@@ -215,17 +229,29 @@ class committed_descriptor_impl { | |||
std::shared_ptr<IdxGlobal> factors_and_scan; | |||
detail::level level; | |||
std::size_t length; | |||
std::size_t committed_length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is committed length and how does it differ from just length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
committed length is the length that is committed corresponding to that dimension.
Now, the problem size needn't be the size that was committed, hence the two variables. Notice how prepare_implementation
is now returning the problem size as well (here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case does it differ from length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prime sized inputs, where we need to pad (Bluestein)
and in the future, in the case of Rader FFTs, where this will becomes committed_length - 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case I think the variable name could be improved. length
is what user commits and that is what I think of when I read committed_length
. How about impl_ct_length
? Also add a comment explaining when this differs from length
.
return {detail::level::GLOBAL, param_vec}; | ||
if (!detail::factorize_input(fft_size, check_and_select_target_level)) { | ||
param_vec.clear(); | ||
fft_size = static_cast<IdxGlobal>(std::pow(2, ceil(log(static_cast<double>(fft_size)) / log(2.0)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike changing this variable in place - fft size is not changing, but you are using the same variable for something else. Use a new variable. rounded_up_fft_size
maybe?
detail::factorize_input(fft_size, check_and_select_target_level); | ||
detail::factorize_input(fft_size, check_and_select_target_level); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that factorization for global is a really confusing piece of code, which should be refactored. The confusing part here is that factorize_input
calls check_and_select_target_level
, which appends factors to param_vec
, so calling it twice here actually makes sense.
But that has not been introduced in this PR, so it makes sense to leave refactoring for a separate task.
* @return sycl::event waiting on the last transposes | ||
*/ | ||
template <Idx SubgroupSize, typename Scalar, domain Domain, typename TIn, typename TOut> | ||
sycl::event global_impl_driver(const TIn& input, const TIn& input_imag, TOut output, TOut output_imag, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like putting driver in function names. It is such an overloaded term, which makes it less clear. There are GPU drivers, compiler driver ... Can you think of another name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this, and I genuinely cannot come up with a better name, and I feel the driver suffix is apt,
because this function launches the compute and transpose kernels, manages dependencies, and increments pointer per level. It "drives" the global implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just global_impl
? I know this is a bit of nitpicking so feel free to ignore if you disagree.
auto in_acc_or_usm = get_access(input, cgh); | ||
auto out_acc_or_usm = get_access(output, cgh); | ||
cgh.host_task([&]() { | ||
for (std::size_t i = 0; i < num_copies; i++) { | ||
events.push_back(queue.copy(&in_acc_or_usm[0] + i * src_stride + input_offset, | ||
&out_acc_or_usm[0] + i * dst_stride + output_offset, num_elements_to_copy)); | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is undefined behavior for buffers. queue.copy
is only defined for USM pointers, NOT pointers to buffer memory. However there are queue.copy
overloads that accept accessors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know that, thanks for letting me know.
I will look into it
} | ||
|
||
template <typename Scalar, domain Domain> | ||
template <typename Dummy> | ||
struct committed_descriptor_impl<Scalar, Domain>::calculate_twiddles_struct::inner<detail::level::GLOBAL, Dummy> { | ||
static Scalar* execute(committed_descriptor_impl& desc, dimension_struct& /*dimension_data*/, | ||
static Scalar* execute(committed_descriptor_impl& desc, dimension_struct& dimension_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did twiddle calculation need to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To calculate the twiddles for both forward and backward factors, I have refactored it a bit to avoid code duplication.
if (ptr != nullptr) { | ||
return std::shared_ptr<T>(ptr, [captured_queue = queue](T* ptr) { | ||
if (ptr != nullptr) { | ||
sycl::free(ptr, captured_queue); | ||
} | ||
}); | ||
} | ||
throw internal_error("Could not allocate usm memory of size: ", size * sizeof(T), " bytes"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (ptr != nullptr) { | |
return std::shared_ptr<T>(ptr, [captured_queue = queue](T* ptr) { | |
if (ptr != nullptr) { | |
sycl::free(ptr, captured_queue); | |
} | |
}); | |
} | |
throw internal_error("Could not allocate usm memory of size: ", size * sizeof(T), " bytes"); | |
if (ptr == nullptr) { | |
throw internal_error("Could not allocate usm memory of size: ", size * sizeof(T), " bytes"); | |
} | |
return std::shared_ptr<T>(ptr, [captured_queue = queue](T* ptr) { | |
if (ptr != nullptr) { | |
sycl::free(ptr, captured_queue); | |
} | |
}); |
This would look better with less nesting.
if (conjugate_on_store == detail::complex_conjugate::APPLIED) { | ||
conjugate_inplace(priv, fft_size); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you should always conjugate before the logging or after the logging. I realize that we had 1 before and 1 after already, but both before makes more sense to me.
The current implementation does not handle small prime values. |
Adds support for sizes which has prime factors for both
INTERLEAVED_COMPLEX
andSPLIT_COMPLEX
layout.Checklist
Tick if relevant: