Skip to content

Replace Travis CI with GitHub Actions#85

Merged
andreasnoack merged 14 commits intoJuliaParallel:masterfrom
AJ0070:pr-84
Apr 1, 2026
Merged

Replace Travis CI with GitHub Actions#85
andreasnoack merged 14 commits intoJuliaParallel:masterfrom
AJ0070:pr-84

Conversation

@AJ0070
Copy link
Copy Markdown
Contributor

@AJ0070 AJ0070 commented Mar 9, 2026

Remove .travis.yml and add .github/workflows/ci.yml. Tests Julia 1 on Linux (x64) and macOS (Apple Silicon), and Julia pre on Linux.

continuing work from #84

andreasnoack and others added 10 commits March 3, 2026 21:51
Remove .travis.yml and add .github/workflows/ci.yml. Tests Julia 1
on Linux (x64) and macOS (Apple Silicon), and Julia pre on Linux.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevent the default Grid from being garbage collected while
distributed matrices or vectors are still alive, matching the
pattern already used by DistMatrix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MPICH 5.0 adopted the standard MPI ABI where MPI_Comm changed from
int to a pointer type. The pre-compiled Elemental binary is
incompatible with this new ABI, causing segfaults in Grid::VCSize().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Elemental binary was linked against libmpi.so.12 (MPICH 3.x).
MPICH 4.x may have internal ABI differences despite the same soname.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Elemental binary segfaults on Julia 1.12 in Grid::VCSize()
regardless of MPICH version. Adding 1.10 to determine if this is
a Julia version compatibility issue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The C interface functions ElDistSparseMatrixCreate and
ElDistMultiVecCreate take an ElConstGrid (Grid pointer), not an
MPI_Comm. The Julia wrappers were incorrectly passing an ElComm
integer which got reinterpret_cast'd as a Grid pointer, causing
a null pointer dereference in Grid::VCSize().

Also revert the MPICH_jll version constraint and Julia 1.10 CI
entry since those were not the root cause.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ElDistSparseMatrixComm and ElDistMultiVecComm don't exist in the
C API. Replace comm(A) calls with A.grid and remove the dead comm
functions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AJ0070
Copy link
Copy Markdown
Contributor Author

AJ0070 commented Mar 9, 2026

@andreasnoack tests are passing now!

@andreasnoack
Copy link
Copy Markdown
Member

I'm surprised that the last commit causes lav.jl to time out. Any idea why that could be?

@AJ0070
Copy link
Copy Markdown
Contributor Author

AJ0070 commented Mar 11, 2026

I am not sure, currently trying to figure it out

@AJ0070
Copy link
Copy Markdown
Contributor Author

AJ0070 commented Mar 31, 2026

@andreasnoack Is it okay to reduce the n0 and n1 in lav.jl? tests are passing on n0 = 12 and n1=12

@AJ0070
Copy link
Copy Markdown
Contributor Author

AJ0070 commented Mar 31, 2026

pushed it to see if it works on CI

@AJ0070
Copy link
Copy Markdown
Contributor Author

AJ0070 commented Mar 31, 2026

@andreasnoack I switched lav.jl to El.lav(A, b) (default control path) to avoid intermittent CI timeouts/hangs while keeping LAV coverage in the standard test run. We can move the control-heavy variant to a separate slow/nightly test if preferred.

@andreasnoack
Copy link
Copy Markdown
Member

How fast is it with the default settings? Is it a slowdown or a stall when using the custom control? I'm speculating if it caused by the printing.

@AJ0070
Copy link
Copy Markdown
Contributor Author

AJ0070 commented Mar 31, 2026

@andreasnoack I checked this in our current test setup: with the custom LPAffineCtrl path it behaves like a stall, while El.lav(A, b) completes reliably. I suspect the progress/print/time settings are a major factor (multi-rank logging overhead), so I switched this test to default settings to keep CI stable.

@andreasnoack
Copy link
Copy Markdown
Member

Did you try to run it locally with the custom control? Is it only on CI that it stalls or are you able to reproduce locally?

@AJ0070
Copy link
Copy Markdown
Contributor Author

AJ0070 commented Apr 1, 2026

I tried it locally and it's stalling locally too.

Copy link
Copy Markdown
Member

@andreasnoack andreasnoack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stall must be a regression but it is hard to know when it happened. I'll approve the current version but please file an issue with the details about the stall, such that it is tracked.

@andreasnoack andreasnoack merged commit 19426bd into JuliaParallel:master Apr 1, 2026
3 checks passed
@AJ0070
Copy link
Copy Markdown
Contributor Author

AJ0070 commented Apr 1, 2026

@andreasnoack Sure, I’ll do that. How long will it take for the new release to be available?

@andreasnoack
Copy link
Copy Markdown
Member

JuliaRegistries/General#151853

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants