-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve basic_string::find_first_of
and basic_string::find_last_of
vectorization for large needles or very large haystacks
#5029
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
https://github.com/AlexGuteniev/STL/tree/ascii-table-experiment is a branch with altered benchmark program that I used to confirm the thresholds characteristics and find out their values. It is experimental science, not just plain theory 🔬 ! |
This comment was marked as resolved.
This comment was marked as resolved.
basic_string::find_first_of
and basic_string::find_last_of
vectorization for large needlesbasic_string::find_first_of
and basic_string::find_last_of
vectorization for large needles or very large haystacks
This comment was marked as resolved.
This comment was marked as resolved.
Even though sll is potentially more expensive than sllv, the broadcast is still more expensive
This comment was marked as resolved.
This comment was marked as resolved.
stl/src/vector_algorithms.cpp
Outdated
if (_Use_sse42()) { | ||
if (_Use_avx2()) { | ||
if (_Bitmap::_Use_bitmap_avx<_Ty>(_Haystack_length, _Needle_length) | ||
&& _Bitmap::_Can_fit_256_bits_sse(static_cast<const _Ty*>(_Needle), _Needle_length)) { | ||
return _Bitmap::_Impl_last_avx<_Ty>(_Haystack, _Haystack_length, _Needle, _Needle_length); | ||
} | ||
} else { | ||
if (_Bitmap::_Use_bitmap_sse<_Ty>(_Haystack_length, _Needle_length) | ||
&& _Bitmap::_Can_fit_256_bits_sse(static_cast<const _Ty*>(_Needle), _Needle_length)) { | ||
_Bitmap::_Scalar_table_t _Table = {}; | ||
_Bitmap::_Build_scalar_table_no_check<_Ty>(_Needle, _Needle_length, _Table); | ||
return _Bitmap::_Impl_last_scalar<_Ty>(_Haystack, _Haystack_length, _Table); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about this control flow: If we can use AVX2, but the _Use_bitmap_avx
threshold isn't met, we don't try the _Use_bitmap_sse
threshold, and instead just fall through. But if we can only use SSE4.2, we try the SSE4.2 threshold before falling through, indicating that the guarded code is valuable. From a brief glance, I see that there are cases where the AVX2 threshold wouldn't activate, but the SSE4.2 threshold would.
Given that the threshold checks are cheap integer comparisons, should we follow the more common pattern of "if we can use AVX2 and the AVX2 threshold is met, return cool AVX2 thing. If we can use SSE4.2 and the SSE4.2 threshold is met, return slightly less cool SSE4.2 thing. Return fallback thing."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to approximate the problem by not looking into these cases and not picking between three algorithms.
The cases we pessimize by this approximation are the cases where the needle greatly exceeds the haystack, and the pessimization is by some noticeable percentage, but not like 2x.
If we want to do this right, we should handle not only the cases you see here: "AVX bitmap is not preferable over vector nested loop, but scalar bitmap is preferable over vector nested loop", but also a set of cases "AVX bitmap is preferable over vector nested loop, and scalar bitmap is also preferable over vector nested loop, and scalar bitmap is preferable over AVX bitmap". The latter is a greater set.
At some point the AVX threshold function returned enum, we need to return to that:
enum class _Strategy { _No_bitmap, _Scalar_bitmap, _Vector_bitmap };
_Strategy _Pick_strategy_avx(const size_t _Count1, const size_t _Count2) noexcept { ... }
So I don't mind handling these cases better, at the expense of threshold complication, do you want me to do that?
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
@StephanTLavavej mentioned on Discord that it is ok to use three-way strategy function. As the implementation of it still needs to be established (there was one but it matched previous AVX bitmap implementation), moving to work in progress again. |
Follow up on #4934 (comment):
Looked closer into that case, and made it even faster than it was.
🗺️ Summary of changes
This PR consists of the following changes:
__std_find_first_of_trivial_pos_N
family that is used by strings and string view. The existing__std_find_first_of_trivial_N
is still used by the standalone algorithmconstexpr
context, because why not⚙️ Vector bitmap algorithm
It is an AVX2-only algorithm. It processes 8 values at once.
In a similar way to the existing scalar bitmap algorithm, can be used when all needle character values do not exceed 255. Instead of having an array of 256
bool
values, it uses an actual bitmap. The whole bitmap can fit into__m256i
variable, that is, an AVX2 register.If another AVX2 register contains 8 32-bit values, which are indices to 32-bit bitmap parts,
_mm256_permutevar8x32_epi32
(vpermd
) can look up 8 parts at once. The indices to the parts are high 3 bits of 8 bit values. The low 5 bits can be then used to obtain the exact bit in 32-bit sequence by a shift. In AVX2 there's are variable 32-bit shift that use a vector of shift values instead of just one for all:_mm256_srlv_epi32
,_mm256_sllv_epi32
. The resulting mask can be obtained by_mm256_movemask_ps
.Bitmap building
Small needles
Unfortunately, there's no instruction in AVX2 that can combine bits from different values of the same vector in a single element. This means that the bitmap building has to be fully scalar, or at least partially (when doing some processing in parallel, but doing final steps in scalar)
The scalar bitmap building loop performs rather poorly, worse than a loop that builds
bool
array. So I implemented a loop that uses vector instructions for that, so it uses vector registers and no stack, it seems faster than creating a stack array and loading it after. The key things in this approach is that a value from one of the shifts is expanded via_mm256_cvtepu8_epi64
, so a 32-bit shift becomes a 256-bit shift of a lower granularity, the granularity is added back by another shift.I've managed to have only a slight improvement when trying to partially parallel it, and the complexity of bitmap building grew significantly, so let's probably don't to it.
A different variations of non-parellel bitmap building have about the same performance, so I kept almost the same that I tried at first, except that I adjusted it to work fine in 32-bit x86 too.
Large needles
The vector instructions loop that builds performs poorly relative to the
bool
array building loop. At some point it makes sense to buildbool
array and compress it to a bitmap. As the size of array/bitmap is constant, it is constant instructions sequence, without loop, and it takes constant time.Test for suitable element values
This is done separately, before creating the bitmap. This separate check is vectorized, and allows to bail out quickly, if values aren't right, without building the bitmap. There isn't specific benchmark for that currently, but I think this would work.
Advantage over existing
The cases where the needle and haystack product is big enough to make the existing vector algorithms bad, but the haystack is still way bigger that the needle, so the scalar bitmap lookup is also bad. Added some of them to the benchmark.
Surprisingly, this extends to the case with very small needles. With over like 1000 element, vector bitmap wins over SSE4.2 even for just a few needle elements.
Can we have this in SSE?
No. There's
_mm256_shuffle_epi8
to do the bitmap parts extraction. But there's no variable vector shift. There isn't even variable vector shift in AVX2 with vector element width smaller than 32. So probably nothing better than using 8-element AVX2 vector.⚖️ Selecting algorithm
The problem with estimating run time in advance is that we don't know how long will it run. The algorithm doesn't run full haystack, if the position is found earlier.
But when selecting algorithm we know only full length. Knowing the full length we can at least estimate the worst case.
Let's still start with worst case, will get back to early return possibility later on.
Run time evaluation
The nested loop algorithms, both scalar and both vectors, are O(m*n), and definitely the vector algorithms is preferred for any noticeably high values of m and n.
Also any bitmap algorithm is faster than nested scalar, unless the element is found in the very first position. So we can safely exclude the nested loop scalar from consideration.
Both scalar and vector bitmap algorithms are some sort of O(n + m), and they have quite different weights of m and n. Specifically, vector bitmap algorithm treat needle length way worse than haystack length, because this part is not parallel, and scalar bitmap algorithm treats them almost equally (surprisingly, needle has slightly less weight).
Vector nested loop algorithm clearly outperforms when both n and m are small, so their product is also small. In specific cases, vector algorithm is linear, if either n or m is within a single vectorization unit. In this case it doesn't even have a nested loop (for short needle it is a deliberate optimization, for small haystack it is the result of the separate haystack tail processing).
After benchmarking these edge cases, it can be seen that vector nested loop outperforms everything for long needle small haystack, but it doesn't always outperform vector bitmap for short needle / large haystack. The former allows to exclude scalar bitmap algorithm from the consideration: with any not very small haystack, vector bitmap algorithm advantage is noticeable. Very small set of cases where scalar bitmap can win (small but not very small haystack and long needle) still don't give it a solid win, these cases are ultimately bound by the same scalar bitmap building loop for both algorithms. The benchmark here still may show noticeable difference, but only because these are different instances of that loop, and some codegen factors or other random factors might affect it.
So we need to pick:
It is hard to reason about the threshold functions, so the thresholds were obtained by aggressive benchmarking.
Considering early return
There is early return possibility.
If we don't consider it, we may pick a bitmap algorithm where vector nested loop is better.
If we will expect it, but it will not happen, we may pick vector nested loop when a bitmap algorithm is better.
Looks like that the latter gives worse error.
Generally the price of error is small for short needles. Long needles are gambling cases. But even for long needles the price for not picking vector nested loop when it is better is no more than 2x.
Why this dispatch is not in headers?
No big reason.
There's overflow multiply instrisic used from
<intrin.h>
, but that one is not essential.Maybe also this will make maintenance easier, by having fewer functions exposed from
vector_algorithm.cpp
Otherwise I guess I'm just like hiding the complexity under a carpet.
🛑 Risks
This time I don't see anything that seems incorrect, it is a complex change with some risks to consider:
__std_find_last_of_trivial_pos_N
usage, see belowChanged
__std_find_last_of_trivial_pos_N
usage__std_find_last_of_trivial_pos_N
has been shipped in #4934. Now it does the bitmap, which is not what old code expects. Although all bad would happen is when the header implementation would fail the scalar bitmap due to bad values, this would unnecessary try the bitmap again. This time the attempt would be even faster due to the vectorization of checking, unless the user does not have SSE4.2I just don't want to add more functions with more names just for this reason
Not wanting to have this situation for another function is the reason I made this PR before the
_not_
vectorization (remaining for find 🐱 family)⏱️ Benchmark results
Click to expand: