Skip to content

Frontier Benchmarking (#453) #881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

Malmahrouqi3
Copy link
Collaborator

@Malmahrouqi3 Malmahrouqi3 commented Jun 11, 2025

Description

Added one GPU benchmarking case by submitting SLURM jobs on Frontier - duplicate implementation of Phoenix. (#453)

Manually Benchmarking,

Cloning

git clone --depth 1 https://github.com/MFlowCode/MFC.git master
git clone https://github.com/Malmahrouqi3/MFC-mo2.git pr --branch frontier-CI2

Copying Bash Scripts into master

rm -rf master/.github/workflows/*
cp -r pr/.github/workflows/* master/.github/workflows/*

Submit Benchmark Jobs

bash pr/.github/workflows/frontier/submit-bench.sh pr/.github/workflows/frontier/bench.sh gpu
bash master/.github/workflows/frontier/submit-bench.sh master/.github/workflows/frontier/bench.sh gpu

Process Benchmark Results
once the slurm jobs are done

cd pr && . ./mfc.sh load -c f -m g
./mfc.sh bench_diff ../master/bench-gpu.yaml ../pr/bench-gpu.yaml

Copy link

codecov bot commented Jun 12, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 45.98%. Comparing base (4864d36) to head (32a8292).

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #881   +/-   ##
=======================================
  Coverage   45.98%   45.98%           
=======================================
  Files          68       68           
  Lines       18629    18629           
  Branches     2239     2239           
=======================================
  Hits         8566     8566           
  Misses       8711     8711           
  Partials     1352     1352           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Malmahrouqi3
Copy link
Collaborator Author

Reduced the job duration to 3 hrs to see whether it would yield the same error regardless of duration.

@sbryngelson sbryngelson requested a review from Copilot June 21, 2025 16:44
@sbryngelson
Copy link
Member

Most Recent Failed Frontier-Benchmark Commit: MFlowCode/MFC/actions/runs/15743760213

Run (cd pr     && bash .github/workflows/frontier/submit-bench.sh .github/workflows/frontier/bench.sh gpu) &
  (cd pr     && bash .github/workflows/frontier/submit-bench.sh .github/workflows/frontier/bench.sh gpu) &
  (cd master && bash .github/workflows/frontier/submit-bench.sh .github/workflows/frontier/bench.sh gpu) &
  wait %1 && wait %[2](https://github.com/MFlowCode/MFC/actions/runs/15743760213/job/44388664302#step:5:2)
  shell: /usr/bin/bash -e {0}
  env:
    ACTIONS_RUNNER_FORCE_ACTIONS_NODE_VERSION: node16
    ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true
bash: .github/workflows/frontier/submit-bench.sh: No such file or directory
sbatch: error: Batch script contains DOS line breaks (\r\n)
sbatch: error: instead of expected UNIX line breaks (\n).
Error: Process completed with exit code 1.

This PR has taken needlessly quite some time, so to bypass the wait time of the entire CI, I will remove all .github/workflow content/files except Frontier Benchmark test. I just wanted to confirm the existing bash scripts would success specifically for [Oak Ridge | Frontier (CCE) (gpu)]. Afterwards, offloaded files will be restored.

Edit: I tried dos2unix which did absolutely nothing, and when I git add'ed the bash files, they conceived no change at all.

I fixed it. You dos2unixed the wrong file.

@Malmahrouqi3
Copy link
Collaborator Author

I did dos2unix for all frontier directory files. Anyways thanks, I will wait if that is gonna pass the test now.

@sbryngelson
Copy link
Member

This benchmark test will never pass in its current state because the Frontier files for benchmarking do not exist on the master branch, hence this error

(cd pr     && bash .github/workflows/frontier/submit-bench.sh .github/workflows/frontier/bench.sh gpu) &
  (cd pr     && bash .github/workflows/frontier/submit-bench.sh .github/workflows/frontier/bench.sh gpu) &
  (cd master && bash .github/workflows/frontier/submit-bench.sh .github/workflows/frontier/bench.sh gpu) &
  wait %1 && wait %[2](https://github.com/MFlowCode/MFC/actions/runs/15826502985/job/44607985758?pr=881#step:5:2)
  shell: /usr/bin/bash -e {0}
  env:
    ACTIONS_RUNNER_FORCE_ACTIONS_NODE_VERSION: node16
    ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true
bash: .github/workflows/frontier/submit-bench.sh: No such file or directory
Submitted batch job [3](https://github.com/MFlowCode/MFC/actions/runs/15826502985/job/44607985758?pr=881#step:5:3)531713

once it looks like everything is working as well as one can expect, we can merge in the minimal files (.github/workflows/*) and then create a new PR that tests it properly.

@Malmahrouqi3
Copy link
Collaborator Author

aight, myself or someone has to test it out manually by cloning master & pr and adding bash files in each then benchmarking on Frontier as a slurm/interative job to make sure nothing will corrupt in the process.

@wilfonba
Copy link
Contributor

I verified that this works on my end. The IBM case still gives NaNs though...

@Malmahrouqi3
Copy link
Collaborator Author

Malmahrouqi3 commented Jun 25, 2025

I verified that this works on my end. The IBM case still gives NaNs though...

Thanks much, and I wonder what the deal is with the IBM case ngl. Any specific error messages or such? If the issue persists, we can just exclude that case somehow. Also, NaNs I guess won't fail the test as can be seen on my recent PR when I assigned null to IBM grind/exec #895 (comment)

Edit: lmk, if you suspect anything that might have caused that.

@wilfonba
Copy link
Contributor

Well, the NaN issue was supposed to be fixed by #892 but it appears that that's not the case

@sbryngelson
Copy link
Member

status?

@Malmahrouqi3
Copy link
Collaborator Author

@sbryngelson done on my end tbh and nothing to add

@sbryngelson
Copy link
Member

Well, the NaN issue was supposed to be fixed by #892 but it appears that that's not the case

what's going on here?

@wilfonba
Copy link
Contributor

Well, the NaN issue was supposed to be fixed by #892 but it appears that that's not the case

what's going on here?

Any ideas @anandrdbz ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants