- 
                Notifications
    You must be signed in to change notification settings 
- Fork 929
WeeklyTelcon_20200623
        Geoffrey Paulsen edited this page Jun 29, 2020 
        ·
        1 revision
      
    - Dialup Info: (Do not post to public mailing list or public wiki)
- Aurelien Bouteiller (UTK)
- Austen Lauria (IBM)
- Barrett, Brian (AWS)
- Brendan Cunningham (Intel)
- Christoph Niethammer (HL
- Edgar Gabriel (UH)
- Geoffrey Paulsen (IBM)
- Harumi Kuno (HPE)
- Howard Pritchard (LANL)
- Jeff Squyres (Cisco)
- Joseph Schuchart
- Matthew Dosanjh (Sandia)
- Nathan Hjelm (Google)
- Naughton III, Thomas (ORNL)
- Ralph Castain (Intel)
- Todd Kordenbrock (Sandia)
- William Zhang
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (nVidia/Mellanox)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- David Bernhold (ORNL)
- Erik Zeiske
- Geoffroy Vallee (ARM)
- George Bosilca (UTK)
- Josh Hursey (IBM)
- Joshua Ladd (nVidia/Mellanox)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Michael Heinz (Intel)
- Noah Evans (Sandia)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- William Zhang (AWS)
- Xin Zhao (nVidia/Mellanox)
- mohan (AWS)
Blockers All Open Blockers
Review v4.1.x Milestones v4.1.0
- Schedule:  Want to release mid-July
- RC1 probably cant do end of this week, a lot of big PRs outstanding.
 
- Release Engineers: Brian (AWS) Jeff Squyres (Cisco)
- We've come to consensus for a v4.1.0 release
- Need include/exclude selection, worried about consistent selection.
- Alot of PRs outstanding, but can't merge until
- Patch for OFI stuff messed up v4.1.x branch.
- Howard has a fix PR, Jeff is looking at.
 
- Howard changed new OFI BTL parameters to be consistent with MTL
- Not breaking ABI or backwards compatibility.
- v4.1.x branch, branched from v4.0.4 tag.
- NOT touching runtime!!!
- Not going to be pulling in a new PMIx version.
 
- All MTT is online on v4.1.x branch
- Not compiling under SLURM EFA test. (OFI BTL issue)
Review v4.0.x Milestones v4.0.4
- v4.0.4 Released
- v4.0.5 - No schedule yet.
- Two potential drivers for a quick v4.0.5 turn-around.
- OSC RDMA Bug - May drive a v4.0.5 release.
- Program Aborts on detach.
 
- OSC pt2pt we have on v4.0.x
- Fragmented Puts, the counting is not correct for a particular user request
- Non-continguous rPuts.
- Also needed in a v4.0.5
 
- How urgent is ROMIO fix?
- Good to have in v4.0.5, but hard to make testcase to hit.
 
- usNic failing almost all multi-node tests on v4.0.x
- Jeff started to look at last week, but didn't get to look at this last week.
- v4.0.x WAS working, and seeing Master failing.
- ACTION - check back next week.
 
- iWarp support Issue 7861.
- How are we supposed to run iWarp in Open-MPI v4.0.x?
- How much do we care about iWarp?
- At a minimum need to update FAQ.
 
Review v5.0.0 Milestones v5.0.0
- 
Need to put OSC pt2pt - OS RDMA requires a single BTL that can contact every single process.
- This didn't use to be the case. (Comment in the code)
 
 
- OS RDMA requires a single BTL that can contact every single process.
- 
We can't use the OSC pt2pt. - It is not thread safe. Doesn't conform to MPI4 standard. Not safe.
- This is just a testing falicy. Could add tests to show this, but still at same boat.
- Either product A or B is broken and we need to fix it.
 
- 
RDMA Onesided should fall back to "my atomics" because TCP will never have rdma atomics. - The idea was to put the atomics into the BTL base, which could do all of the one-sided atomics under the covers.
 
- 
Jeff will close the PR, and 
- 
Jeff will Nathan will fetching, get, compare and swap. 
- 
Two new PRs for MPI4.0 Error handling - new PRs from Aurelien Bouteiller. 
- 
Does UCX support iWarp? - Does libFabric support iWarp via verbs provider?
- https://github.com/openucx/ucx/issues/2507 suggest it doesn't.
- Brian thinks that libFabric
- OFI can support iWarp, just need to specify the provider in the include list.
- This person who's asking is a partner not a customer
 
- 
PMIX - Working on PMIx v4.0.0 which is what Open MPI v5.0 will use.
- Sessions needs something from PMIx v4
- ULFM - not sure if it needs PMIx, think it needs PRRTE changes.
- PPN scaling issue - simple algorithmic issue in this function
- PMIX talked about it. Artem might know someone who might be interested in working on it.
- Algorithm behind one of the interfaces doesn't scale well.
- Not a regression. Above ~ 4K nodes, becomes quadratic.
 
 
- 
PRRTE - Nothing's happening there.
 
- Austen went through master
- UCX is failing in certain test cases, SEGV
- Austen will open an issue.
 
- PRRTE is hitting and assert in some cases.
- Austen will Open Issue
 
- Remaining CISCO failures look like connectivity issues.
- Jeff hasn't got to look deeper to see
- Looks like USNIC is either not being picked or disqualifying itself internic.
 
- CLANG - added float16
- Need to add a special compiler flag for software emulation of float16.
- Not magically add that flag.
 
- Many companies are not allowing a face to face travel until 2021 due to COVID19.
- Instead lets do a series of virtual-face to face?
 
- Yes this summer to discuss for v5.0
- Maybe we can do it by topic?
- Maybe not 4 or 8 hour things.
 
- Different topics on different days.
- Do a doodle poll of least-worse days in late July/August.
- Start a list of topics.
- George and Jeff will help plan and come to community.
- May not have Super Computing conference at ALL this year.
- Many other projects are doing a virtual state of the union type meeting to try to cover what they'd usually do in a Birds of a feather meeting.
- Then this works pretty well, and do this a couple of times a year.
- Not constrained to Super Computing
- scale-testing, PRs have to opt-into it.