Replace Travis CI with GitHub Actions#85
Conversation
Remove .travis.yml and add .github/workflows/ci.yml. Tests Julia 1 on Linux (x64) and macOS (Apple Silicon), and Julia pre on Linux. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevent the default Grid from being garbage collected while distributed matrices or vectors are still alive, matching the pattern already used by DistMatrix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MPICH 5.0 adopted the standard MPI ABI where MPI_Comm changed from int to a pointer type. The pre-compiled Elemental binary is incompatible with this new ABI, causing segfaults in Grid::VCSize(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Elemental binary was linked against libmpi.so.12 (MPICH 3.x). MPICH 4.x may have internal ABI differences despite the same soname. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Elemental binary segfaults on Julia 1.12 in Grid::VCSize() regardless of MPICH version. Adding 1.10 to determine if this is a Julia version compatibility issue. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The C interface functions ElDistSparseMatrixCreate and ElDistMultiVecCreate take an ElConstGrid (Grid pointer), not an MPI_Comm. The Julia wrappers were incorrectly passing an ElComm integer which got reinterpret_cast'd as a Grid pointer, causing a null pointer dereference in Grid::VCSize(). Also revert the MPICH_jll version constraint and Julia 1.10 CI entry since those were not the root cause. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ElDistSparseMatrixComm and ElDistMultiVecComm don't exist in the C API. Replace comm(A) calls with A.grid and remove the dead comm functions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@andreasnoack tests are passing now! |
|
I'm surprised that the last commit causes |
|
I am not sure, currently trying to figure it out |
|
@andreasnoack Is it okay to reduce the n0 and n1 in lav.jl? tests are passing on n0 = 12 and n1=12 |
|
pushed it to see if it works on CI |
|
@andreasnoack I switched lav.jl to El.lav(A, b) (default control path) to avoid intermittent CI timeouts/hangs while keeping LAV coverage in the standard test run. We can move the control-heavy variant to a separate slow/nightly test if preferred. |
|
How fast is it with the default settings? Is it a slowdown or a stall when using the custom control? I'm speculating if it caused by the printing. |
|
@andreasnoack I checked this in our current test setup: with the custom LPAffineCtrl path it behaves like a stall, while El.lav(A, b) completes reliably. I suspect the progress/print/time settings are a major factor (multi-rank logging overhead), so I switched this test to default settings to keep CI stable. |
|
Did you try to run it locally with the custom control? Is it only on CI that it stalls or are you able to reproduce locally? |
|
I tried it locally and it's stalling locally too. |
andreasnoack
left a comment
There was a problem hiding this comment.
The stall must be a regression but it is hard to know when it happened. I'll approve the current version but please file an issue with the details about the stall, such that it is tracked.
|
@andreasnoack Sure, I’ll do that. How long will it take for the new release to be available? |
Remove .travis.yml and add .github/workflows/ci.yml. Tests Julia 1 on Linux (x64) and macOS (Apple Silicon), and Julia pre on Linux.
continuing work from #84