Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(into master_june24) merge of master into master_june24 and channelid fixes/reimplementation #882

Open
wants to merge 682 commits into
base: master_june24
Choose a base branch
from

Conversation

valassi
Copy link
Member

@valassi valassi commented Jul 4, 2024

Hi @oliviermattelaer this is a WIP PR to start working towards the resync of master and master_june24. From what I understand this is one of the things you want to push with high priority.

(PS IMPORTANT a summary for review is in #882 (comment) below)

This is constructed as a merge into master_june24.
That is to say, I start from what you and Stefan had in master_june24 (as a result of Stefan's channelid PR #830, related to the warp issue #765), and I start porting a few of the master stuff, rather than going the opposite way. This allows me to go in steps with things I know (the various steps in master).

For the moment, here I am just merging the latest master CI (with tmad tests) into master_june24. Since the CI is enabled also for master_june24, I expect that the new tests should run, and the results may be interesting.

Speaking of which, @roiser @oliviermattelaer, how did you test the code in master_june24?

  • am I supposed to use a different input.txt file to pipe to madevent to specify a range of iconfig's, or will the current one with a single iconfig value be enough?
  • if I am supposed to use the same input.txt with a single iconfig (by looking at driver.f which has not changed I would guess this is the case), can you confirm that the code will still test the new functionality you have created and have a channelid array with different values, or will this result in a channelid array which all have the same value?
  • (@oliviermattelaer for my information, not directly or immediately relevant for tests: is the madevent fortran/python/bash infrastructure to orchestrate fewer G* jobs with many channels per job complete, or is this still under development?)
  • (and also for my information if I should have issues in the code: do I remember correctly that a channelID array eg of 32k channels will be segmented such that inside each 32-channel warp the channelid is the same, but different warps can have different channelids? or did you eventually modify the logic of this?)

Thanks,
Andrea

PS For the context: master_june24 mainly differs from master because of the addition of Stefan's channelid MR #830 which is connected to Olivier's warp work in #765

@valassi
Copy link
Member Author

valassi commented Jul 4, 2024

There are 49 errors in the CI. I opened #883 #884 #885

image

@valassi
Copy link
Member Author

valassi commented Jul 4, 2024

I am trying to fix issues in MG5AMC. Will do a force push and file an issue

@valassi
Copy link
Member Author

valassi commented Jul 4, 2024

I have tried to upgrade MG5AMC from the current eef200f94 to gpucpp_june24, but this fails codegen #886. I have reverted.

I will instead create a branch where I merge gpucpp on top of the eef200f94 which is currently in master_june24.

@valassi
Copy link
Member Author

valassi commented Jul 4, 2024

This is annoying. I upgraded MG5AMC including the rotxxx fix #857 that I used for the crash #855. This has NOT fixed the CI crash of madevent in all CI tests #885. I will need to use a debug build with gdb.

There are still 49 failing tests.
image

@valassi
Copy link
Member Author

valassi commented Jul 4, 2024

I have fixed a minor typo in unit_v for MAC #883. Marking it as fixed.

Now there are only 45 CI errors instead of 49

image

@valassi valassi changed the title WIP: merge of master into master_june24 (for the moment: add master CI to master_june24) WIP: merge of master into master_june24 (for the moment: add master CI to master_june24, identify/fix some issues) Jul 4, 2024
@valassi
Copy link
Member Author

valassi commented Jul 4, 2024

I have fixed another minor issue #884 failing tghe builds for FPTYPE=m. There was one line forgotten from a previous implementation, it should have been removed for FPTYPE=m and was not.

Now down from 45 to 39 errors, all related to #885 crashes in the new CI tests I think. The old CI tests are now all succeeding.

image

@valassi
Copy link
Member Author

valassi commented Jul 5, 2024

I investigated #885 and found that the crash only happens when setting VECSIZE_USED different from VECSIZE_MEMMAX. In the CI in my initial tests VECSIZE_MEMMAX was 16384 and VECSIZE_USED was 32, so this crashed.

Looking more into that, I realised that I was not sure what parameters I should use for NB_WARP and WARP_SIZE. This is discussed in #887. I gave it a try to use NB_WARP=512 with WARP_SIZE=32 ie VECSIZE_MEMAMX=16384. With VECSIZE_USED=32, this still crashes in #885. But in addition I also get a Fortran runtime error in symconf #888.

@valassi
Copy link
Member Author

valassi commented Jul 5, 2024

I added a workaround (NOT a fix) for crash #885 just to allow the CI tests to proceed further. Essentially I put down NB_WARP=1 and WARP_SIZE=32 so that VECSIZE_MEMAMX=32 is the same as VECSIZE_USED=32. This avoids the crash (but avoids testing anything interesting in the new warp infrastructure, making it pointless). A proper fix for #885 (and for #888) is needed.

Apart from other 'expected' failures, there is a xsec mismatch for ggttggg #889. This can be fixed by increasing tolerance

There are ten failures overall. The other 9 are the usual #826 and #872 (pensing in master) and #856 (fixed in master, to be merged here).

image

@valassi
Copy link
Member Author

valassi commented Jul 5, 2024

Ok I added a workaround for the tolerances #889, now dowsn to 9 errors

image

The crash #885 becomes high priority here, otherwise we are not testing anything intresting, only NB_WARP=1...

1 similar comment
@valassi
Copy link
Member Author

valassi commented Jul 5, 2024

Ok I added a workaround for the tolerances #889, now dowsn to 9 errors

image

The crash #885 becomes high priority here, otherwise we are not testing anything intresting, only NB_WARP=1...

@valassi
Copy link
Member Author

valassi commented Jul 5, 2024

Hi @oliviermattelaer as you see I made some progress here, but I am putting this work on hold.

I need some answers on #887 (and it is possible that without VECSIZE_USED I go nowhere and I need to wait for that).

(Or at least: probably I will continue merging bits of master into master_june24 so that we avoid a complete divergence, but I would argue against merging back master_june24 into master until many of these issues are fixed... there are just too many things that seem to not have been tested)

valassi and others added 14 commits July 29, 2024 17:47
Olivier's suggestion for better pep08 formatting in python

Co-authored-by: oliviermattelaer <[email protected]>
Olivier's suggestion for better pep08 formatting in python

Co-authored-by: oliviermattelaer <[email protected]>
…er merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…rder madgraph5#918, and smaller patch madgraph5#939 and madgraph5#849) into susy2

Note: there is no change with respect to the previous origin/susy2 after this merge (no need to rerun the CI)
Regenerate code (and susyggt1t1 reference logs) with couplings fix from PR 918
…ier merging

git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…aster for easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD tmad/logs* tput/logs* | \grep susyggt1t1)
…eam/master for easier merging

git checkout upstream/master $(git ls-tree --name-only HEAD CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/dump_CPUTest.Sigma_MSSM_SLHA2_gg_t1t1x.txt susy_gg_t1t1.*/test/ref/dump_CPUTest.Sigma_MSSM_SLHA2_gg_t1t1x.txt)
Fix conflicts: epochX/cudacpp/tput/throughputX.sh
…ference log files for susy_gg_t1t1.mad after merging upstream/master

CUDACPP_RUNTEST_DUMPEVENTS=1 ./build.cuda_d_inl0_hrd0/runTest_cuda.exe
\cp ../../test/ref/dump* ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/
\cp ../../test/ref/dump* ../../../susy_gg_t1t1.sa/test/ref/
@valassi
Copy link
Member Author

valassi commented Jul 29, 2024

In the meantime I merged upstream/master with #918 and #849

valassi and others added 14 commits August 2, 2024 17:11
…eleting GpuAbstraction.h and GpuRuntime.h madgraph5#947

./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch
git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/dsample.f gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/bin/internal/banner.py gg_tt.mad/bin/internal/gen_ximprove.py gg_tt.mad/bin/internal/madevent_interface.py >> CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
git checkout gg_tt.mad
…aph5#947 fix: do not delete G* if it is not a directory (e.g. GpuAbstraction.h)
…adgraph5#947 fix: do not delete G* if it is not a directory (e.g. GpuAbstraction.h)
…madgraph5#947 in madevent_interface.py (do not delete GpuAbstraction.h) via mg5amcnlo#125
…ing any python file* as suggested by Olivier, after moving the fix for madgraph5#947 to mg5amcnlo#125

Also remove Source/dsample.f from patch.common as this is no longer modified.

The only files that still need to be patched are
- 3 in patch.common: Source/makefile, Source/genps.inc, SubProcesses/makefile
- 3 in patch.P1: auto_dsig1.f, driver.f, matrix1.f

./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch
git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
git checkout gg_tt.mad
…gpucpp after merging/squashing mg5amcnlo#125 fixing madgraph5#947, with an additional fix by Olivier)
…ne24) to f0b429915 (current valassi_gpucpp_june24 including merge of mg5amcnlo#125 to fix madgraph5#947)
upgrade MG5AMC to latest gpucpp and regenerate processes
…dgraph5#949 fixing madgraph5#947 via mg5amcnlo#125) into june24

Fix conflicts: epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common (keep the upstream/master version)
… upstream/master

The patches do not change substantially (only some line/commit numbers change)
NB1: with respect to the previous june24, this clarifies that no python files are patches any more
NB2: with respect to upstream/master, this also removes auto_dsig1.f

The only files that still need to be patched are
- 3 in patch.common: Source/makefile, Source/genps.inc, SubProcesses/makefile
- 2 in patch.P1: driver.f, matrix1.f

./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch
git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
git checkout gg_tt.mad
@valassi
Copy link
Member Author

valassi commented Aug 2, 2024

In the meantime I merged upstream/master with #949

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment