Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hybrid between old and LUT implicit triangulation #971

Draft
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

JonasLukasczyk
Copy link
Contributor

Hi Julien,
this PR is the unholy union between the old and the LUT-based approach for implicit triangulations. For now this seems to result in a 4x speedup. We will work in the background on getting the 8x speedup of the pure LUT-based approach in TTK, but as you said grabbing this speedup essentially for free right now makes sense. I tested this, but since touching the triangulation is tricky work I recommend that we test this as much as possible.

Best
Jonas

@JonasLukasczyk
Copy link
Contributor Author

Hi Julien, I ran clang-format so I don't know why this is complaining.

@julien-tierny
Copy link
Collaborator

you may want to use a version of clang-format in line with the CI's.
fyi, I'm using clang-format 16.0.6 here.

@julien-tierny
Copy link
Collaborator

I've just pushed the diff with clang-format 16. let's see :)

@julien-tierny
Copy link
Collaborator

it looks like the CI is using clang-format 14.0-55.
you may want to use that one.

@julien-tierny
Copy link
Collaborator

I've just tried with clang-format 14.
Let's see if it's any better.

@JonasLukasczyk
Copy link
Contributor Author

If this runs through I will squash the commits.

Copy link
Collaborator

@julien-tierny julien-tierny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi jonas,
thanks a lot!
that looks awesome.

I haven't run it yet (I'll double check the perfs over the weekend).
Could you address the two tiny remarks I mentioned here?

Also, could you please port this query over to the PeriodicImplicitTriangulation as well?
(in principles, there's fewer cases).

Thanks a lot!

core/base/implicitTriangulation/ImplicitTriangulation.h Outdated Show resolved Hide resolved
core/base/implicitTriangulation/ImplicitTriangulation.h Outdated Show resolved Hide resolved
Copy link
Collaborator

@julien-tierny julien-tierny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's another batch of small-scale remarks.
thanks!

core/base/implicitTriangulation/ImplicitTriangulation.h Outdated Show resolved Hide resolved
core/base/implicitTriangulation/ImplicitTriangulation.cpp Outdated Show resolved Hide resolved
core/base/implicitTriangulation/ImplicitTriangulation.h Outdated Show resolved Hide resolved
core/base/implicitTriangulation/ImplicitTriangulation.h Outdated Show resolved Hide resolved
@julien-tierny
Copy link
Collaborator

julien-tierny commented Sep 16, 2023

perf notes (to be formated later)

ScalarFieldSmoother(ctBones)

vger (laptop)

ttk:dev

  • 6 cores, 100 it: 36s
  • 1 core, 25 it: 34s

LUT

  • 6 cores, 100 it: 99s
  • 1 core, 25 it: 97s

hal9000 (desktop)

ttk:dev

  • 24 cores, 100 it: 13.782s
  • 1 core, 25 it: 27.455s

LUT

  • 24 cores, 100 it: 18.048s
  • 1 core, 25 it: 69.135s

@julien-tierny
Copy link
Collaborator

Hi Jonas,

I've been trying to benchmark a little bit the LUT implementation by simply applying the scalar field smoother (which only uses getVertexNeighbor) on ctBones.vti. I have built the LUT implementation with the same options as the current dev branch of TTK.

For some reason, I can't reproduce speedups on my systems (see the above quick benchmark).

Could you try to run this little benchmark on your end and report your numbers?
Here the corresponding python script.

# state file generated using paraview version 5.11.1
import paraview
paraview.compatibility.major = 5
paraview.compatibility.minor = 11

#### import the simple module from the paraview
from paraview.simple import *
#### disable automatic camera reset on 'Show'
paraview.simple._DisableFirstRenderCameraReset()

# ----------------------------------------------------------------
# setup the data processing pipelines
# ----------------------------------------------------------------

# create a new 'XML Image Data Reader'
ctBonesvti = XMLImageDataReader(registrationName='ctBones.vti', FileName=['ctBones.vti'])
ctBonesvti.PointArrayStatus = ['Scalars_']
ctBonesvti.TimeArray = 'None'

# create a new 'TTK ScalarFieldSmoother'
tTKScalarFieldSmoother1 = TTKScalarFieldSmoother(registrationName='TTKScalarFieldSmoother1', Input=ctBonesvti)
tTKScalarFieldSmoother1.ScalarField = ['POINTS', 'Scalars_']
tTKScalarFieldSmoother1.IterationNumber = 25
tTKScalarFieldSmoother1.MaskField = ['POINTS', 'Scalars_']
tTKScalarFieldSmoother1.UseAllCores = 0

UpdatePipeline()

# create a new 'TTK ScalarFieldSmoother'
tTKScalarFieldSmoother2 = TTKScalarFieldSmoother(registrationName='TTKScalarFieldSmoother2', Input=ctBonesvti)
tTKScalarFieldSmoother2.ScalarField = ['POINTS', 'Scalars_']
tTKScalarFieldSmoother2.IterationNumber = 100
tTKScalarFieldSmoother2.MaskField = ['POINTS', 'Scalars_']

UpdatePipeline()

@JonasLukasczyk
Copy link
Contributor Author

Hi Julien,
I can not reproduce your numbers. Can you run HelloWorld on a 1024^3 Wavelet? The dataset is much bigger an here I get a 3x times speedup:
`
DEV

[HelloWorld] Computing Averages .............................. [6.587s|32T|100%]

LUT

[HelloWorld] Computing Averages .............................. [2.228s|32T|100%]
`

I suspect that your compiler decided not to inline the methods and as far as I know there is no way to force this. What are your thoughts @petersteneteg?

Another thought: Is it possible that your coed uses the implicit triangulation with preconditions?

@julien-tierny
Copy link
Collaborator

hi jonas,
thanks for your feedback.
I'll run the test you suggested.
In the meantime, can you please run the python script I suggested and report the numbers?
Thanks.
julien

@julien-tierny
Copy link
Collaborator

julien-tierny commented Sep 20, 2023

perf notes (to be formated later)

HelloWorldCmd -i ctBones512.vti (512 is sufficient to trigger the full implicit triangulation)

hal9000 (desktop)

ttk:dev

  • 1 core: 13.938s
  • 24 cores: 0.934s

ttk:dev with inline fully implicit getVertexNeighbors (https://github.com/julien-tierny/ttk/tree/implicitInlining)

  • 1 core: 7.926s
  • 24 cores: 0.522s
  • 24 cores: 0.436s (2209754)

lut

  • 1 core: 10.193s
  • 24 cores: 0.557s

vger (latop)

ttk:dev

  • 1 core:
  • 6 cores:

lut

  • 1 core: 22.847s
  • 6 cores: 6.405s

perf smoother

ttkScalarFieldSmootherCmd -i ctBones512.vti -I

hal9000 (desktop)

ttk:dev

  • 1 core (5 it): 90.695s
  • 24 cores (50 it): 56.186s

ttk:dev with inline fully implicit getVertexNeighbors (https://github.com/julien-tierny/ttk/tree/implicitInlining)

  • 1 core (5 it): 39.800s
  • 24 cores (50 it): 40.502s

lut

  • 1 core (5 it): 57.788s
  • 24 cores (50 it): 43.719s

perf persistence diagram

ttkPersistenceDiagramCmd -i ctBones512.vti

hal9000 (desktop)

ttk:dev

  • 24 cores: 90.434s (44.357s), 91.534s (44.000s), 91.618s (43.766s), 94.023s (44.365s), 91.203s (44.647s) AVG: 91.7624 (44.227)

ttk:dev with inline fully implicit getVertexNeighbors (https://github.com/julien-tierny/ttk/tree/implicitInlining)

  • 24 cores: 82.889s (36.938s), 83.455s (37.102s), 86.247s (38.791s), 85.258s (37.642s), 83.753s (37.097s) (53e2e58) AVG: 84.3204 (37.514)
  • 24 cores: 86.703s (a53067d)
  • 24 cores: 87.456s (0e4eee9)
  • 24 cores: 84.082s (39.638s), 87.711s (39.942s), 86.249s (39.713s), 88.103s (41.261s), 85.699s (39.606s) (57d7181) AVG: 86.3688 (40.032)

lut

  • 24 cores: 93.594s

@julien-tierny
Copy link
Collaborator

alright, so I believe I understand what's going on here.

in short:

  • preconditioned implicit is ~ x3 faster than LUT which is ~x2 faster than full implicit (especially in parallel)

so then, I believe this PR should be modified. the LUT implementation should replace the full implicit one, not the preconditioned one (I have the impression you replaced both)

@julien-tierny
Copy link
Collaborator

so in short, I believe you should leave the old implementation (i.e. template <typename Derived> int ImplicitTriangulationCRTP<Derived>::TTK_TRIANGULATION_INTERNAL(getVertexNeighbor)(const SimplexId &vertexId, const int &localNeighborId, SimplexId &neighborId) const {) as is in ImplicitTriangulation.cpp, but you should re-implement the function in the case of full implicit triangulation.

for that you can follow the example of the function getVertexPosition that has one (specialized) fully implicit implementation (ImplicitTriangulation.cpp) and a preconditioned (default) one (ImplicitTriangulation.h).

we can discuss this on Friday.
cheers.

@JonasLukasczyk
Copy link
Contributor Author

Hi everyone,
I'm sorry but I can not spend any more time on this. @pierre-guillou can you add this fix?

Best
Jonas

@julien-tierny
Copy link
Collaborator

hi jonas,
I have updated perf numbers above, by taking just the in-lining speedup (https://github.com/julien-tierny/ttk/tree/implicitInlining). I'm currently populating these numbers for my laptop.
Some to talk about on Friday.

@JonasLukasczyk
Copy link
Contributor Author

I don't understand why you don't see any speedup of using the hybrid over the inlined dev version but I think we already reached the point where investing time into the hybrid defeats the purpose. For now it seems much easier to just inline all methods of the fully implicit triangulation and call it a day. Afterwards we can see if a purely LUT-based approach brings additional speedup.

@julien-tierny
Copy link
Collaborator

julien-tierny commented Sep 23, 2023

alright everyone.
so I crunched some numbers, and it turns out that in-lining (final) all the functions of the implicit triangulation (https://github.com/julien-tierny/ttk/tree/implicitInlining) does improve the performances overall.
for the discrete gradient computation (sub-part of the persistence diagram computation, a fairly good benchmark that uses most of the queries of the triangulation), I observe an average acceleration of about 10% against the current dev branch of ttk (see the performance numbers above, the numbers in parentheses correspond to the discrete gradient). that's good.

what bugs me is that, if we don't in-line all functions, but only the getVertexNeighbor and getVertexNeighborNumber (only these two functions, julien-tierny@53e2e58), the gain is actually better (15%).
I don't have a clear explanation here. I thought that in-lining could only increase performances (or in the worst case not improve them), but I was not expecting performances to degrade as more functions were in-lined.

do you guys have any explanations of what is going on?
@petersteneteg @pierre-guillou

thanks a lot for your feedback.
best,

@petersteneteg
Copy link
Contributor

@julien-tierny

So to a first approximation inlining more will make it faster, since it will avoid the function call overhead, and often enables a bunch of other optimization.

The problems arises when a lot of code get inline, then the code size will increase since the code get basically copy pastes many times. And when the code size increases that puts more pressure on the instruction cache in the cpu, basically the processor will be starved for the lack of instruction rather than the lack of data.

So ideally you need to balance the amount of inlining that you are doing, and often the compiler will have a lot of built in heuristics to get a good balance between inlining and code size. But It will generally not always find the optimal balance.

One can use tools like cachegrind. to look at the amount of instruction cache misses vs data cache misses to potentially optimize the performance.

@julien-tierny
Copy link
Collaborator

hi @petersteneteg
thanks a lot for your explanations. all of this is making sense.

I think I'll go ahead and PR the commit which optimizes the performance then (with only the getVertexNeighbors queries inline). Thanks a lot!

@JonasLukasczyk
Copy link
Contributor Author

I think just inlining the neighbor methods is sufficient for now. Those are the most relevant queries anyway. Maybe in the future the LUT-based procedures are also much smaller and therefore fit into the instruction cache. We will have to see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants