-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bounds_t
with pre-aligned mins
and maxs
#1482
base: master
Are you sure you want to change the base?
Conversation
For now there is a bug somewhere I have not found yet. The bug can be seen by loading the In my first iteration of the branch I made a stupid mistakes by passing the |
Note: the percentage of time spent is not reliable in those screenshots because I use a Before (C), a lot of time is spent in After (C), the first loads are just completely skipped: Before (Asm), we see three After (Asm), se see only two |
So, as I said, this is not urgent. But I would appreciate people re-reading what I did in hope someone finds the stupid mistake I may have done. The code change should be straightforward because it's basically rewriting with a different data layout, there is no algorithm change to be expected. Unfortunately this data struct is used in many places so the patch is massive. There are many functions in |
I also added some constness to some function input, which may help the compiler to optimize a bit more some other places of the code. I doubt this extra-constness is the cause of the bug I'm facing because the compiler doesn't complain I'm writing to some |
da96dc5
to
b9fbbab
Compare
out->bounds.mins[ 1 ] = LittleLong( in->mins[ 1 ] ); | ||
out->bounds.mins[ 2 ] = LittleLong( in->mins[ 2 ] ); | ||
out->bounds.maxs[ 1 ] = LittleLong( in->maxs[ 1 ] ); | ||
out->bounds.maxs[ 2 ] = LittleLong( in->maxs[ 2 ] ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I forgot:
out->bounds.maxs[ 0 ] = LittleLong( in->maxs[ 0 ] );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the bug I was looking for.
b9fbbab
to
3176224
Compare
So I reread everything and found the bug I was looking for. I missed the conversion of one line (I just did not re-added the new one, that's why there was no compile type error). It works. |
4bea319
to
e99d1aa
Compare
I found another bug affecting BSP movers like doors. One can test it in the Vega map. The doors disappear according to the viewing angle / point of view. |
92be21c
to
d7d9d1e
Compare
Any fps difference? |
d7d9d1e
to
3239c6e
Compare
This is not urgent at all. This is complementary to:
BoxOnPlaneSide
function #1142It also improves over:
plane_t
struct withnormal
anddist
members #1043The
BoxOnPlaneSide()
function is known to be a hot spot, being called by multiple recursive functions. Right now we spend a lot of time in_mm_loadu_ps
, we have to callsseLoadVec3Unsafe()
explicitly because we can't guess if the data is aligned or not. This comes with multiple downside:_mm_loadu_ps
is said to be slower than_mm_load_ps
, and that fits my experience¹._mm_loadu_ps
and will always call it if we ask for it explicitely, even if the data is already aligned, and by experience, even if no copy is needed.So the idea is to introduce some nice
bounds_t
struct that uses alignedmins
andmaxs
, same forcplane_t
with an alignednormal
. When doing that, we can write an explicit_mm_load_ps
that is faster than_mm_loadu_ps
, and most of the cases, because the compiler knows the data isconst
and already aligned, the compiler just removes the_mm_load_ps
and process the data without any copy.See also:
bounds_t
with pre-alignedmins
andmaxs
Unvanquished/Unvanquished#3265¹ Some times ago I tried to write optimized SSE code for some other functions, but the code was slower because of the explicit
_mm_loadu_ps
call. I even noticed copying an unaligned vec3 to an aligned vec3 before calling_mm_load_ps
could make the code faster when the compiler notices the data is already aligned and skips the copy and calls_mm_load_ps
(or even doesn't need to call it at all).