Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ep2 w4 #80

Closed
wants to merge 7 commits into from
Closed

Ep2 w4 #80

wants to merge 7 commits into from

Conversation

oliviermattelaer
Copy link
Member

I open a PR.
Such that we have a place attached to the code to report finding on this idea.

So this branch implements the idea to replace the W[6] of the wavefuntions by a W[4]
and retrieving from global memory (hoping from L1 cache) the removed information:

Current point of concern with the current implementation:
[ ] need to investigate if the convention for the 4 momenta is general enough
[ ] is this a smart idea

@oliviermattelaer
Copy link
Member Author

Here is some comparison for the e+ e- > mu+ mu- process:
Screenshot 2020-12-01 at 22 10 09

blue is the "W4" branch and orange is "W6"
So we can see that this is going on the wrong direction (decrease compute and increase memory)

Screenshot 2020-12-01 at 22 17 48

we can see the increase in global memory call in this graph.

Screenshot 2020-12-01 at 22 19 03

and the absence of register gain in this plot.

So as expected this does not seems a good idea for ee_mumu....

@oliviermattelaer
Copy link
Member Author

oliviermattelaer commented Dec 1, 2020

Here is the same comparison for g g > t t~ g g
(blue is W4 andd red is W6
Screenshot 2020-12-01 at 23 22 52

No big difference here, and seems to go in the correct direction.

Screenshot 2020-12-01 at 23 23 35

Here the L1 hit rate is much higher, (as well as the amount of transfer)

Screenshot 2020-12-01 at 23 26 45

and the number of register is still to maximum possible...

So not clear if this is worthed or not ....

@oliviermattelaer
Copy link
Member Author

So one nice comparison that this impact significantly the number of register is the number of register used during the kernel execution: (yellow means maximum number of register)so here is the plot for W6

Screenshot 2020-12-01 at 23 51 09

and here is the new plot for w4.
Screenshot 2020-12-01 at 23 53 57

So we have a couple of bottelneck that we have to look at but this might be worth to look at

@valassi
Copy link
Member

valassi commented Dec 3, 2020

Hi Olivier, thanks a lot, very interesting and very good idea to open an issue to document it!

Not sure, maybe keep it in progress for the time being, and we do more tests trying to understand this better? It seems that it is not a game changer. But maybe, if we do find other game changers (cuda streams/graph...?), maybe we shoudl reevaluate this and we will see larger effects. Just brainstorming.

How do you produce the plots of registers actualy used by the way? Interesting plots.

Thanks Andrea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants