Skip to content

Conversation

boegel
Copy link
Contributor

@boegel boegel commented Aug 21, 2025

Will need merge conflict fix after merging of:

@boegel boegel added the 2025.06-software.eessi.io 2025.06 version of software.eessi.io label Aug 21, 2025
@boegel
Copy link
Contributor Author

boegel commented Aug 21, 2025

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws arch:x86_64/intel/sapphirerapids
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc architecture:aarch64/nvidia/grace

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Aug 21, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-intel-sapphirerapids for repository eessi.io-2025.06-software in job dir /project/def-users/SHARED/jobs/2025.08/pr_1159/84506

date job status comment
Aug 21 21:27:08 UTC 2025 submitted job id 84506 awaits release by job manager
Aug 21 21:27:32 UTC 2025 released job awaits launch by Slurm scheduler
Aug 21 21:32:36 UTC 2025 running job 84506 is running
Aug 22 00:17:13 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-84506.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2025.06-software-linux-x86_64-intel-sapphirerapids-17558215770.tar.gzsize: 1534 MiB (1608988725 bytes)
entries: 11733
modules under 2025.06/software/linux/x86_64/intel/sapphirerapids/modules/all
FFTW/3.3.10-GCC-14.2.0.lua
GCC/14.2.0.lua
GCCcore/14.2.0.lua
libevent/2.1.12-GCCcore-14.2.0.lua
libfabric/2.0.0-GCCcore-14.2.0.lua
libffi/3.4.5-GCCcore-14.2.0.lua
libidn2/2.3.7-GCCcore-14.2.0.lua
libxml2/2.13.4-GCCcore-14.2.0.lua
numactl/2.0.19-GCCcore-14.2.0.lua
OpenSSL/3.lua
Perl/5.38.0.lua
Perl/5.40.0-GCCcore-14.2.0.lua
pkgconf/1.8.0.lua
pkgconf/2.3.0-GCCcore-14.2.0.lua
SQLite/3.47.2-GCCcore-14.2.0.lua
Tcl/8.6.16-GCCcore-14.2.0.lua
UCX/1.18.0-GCCcore-14.2.0.lua
UnZip/6.0-GCCcore-14.2.0.lua
xorg-macros/1.20.2-GCCcore-14.2.0.lua
software under 2025.06/software/linux/x86_64/intel/sapphirerapids/software
FFTW/3.3.10-GCC-14.2.0
GCC/14.2.0
GCCcore/14.2.0
libevent/2.1.12-GCCcore-14.2.0
libfabric/2.0.0-GCCcore-14.2.0
libffi/3.4.5-GCCcore-14.2.0
libidn2/2.3.7-GCCcore-14.2.0
libxml2/2.13.4-GCCcore-14.2.0
numactl/2.0.19-GCCcore-14.2.0
OpenSSL/3
Perl/5.38.0
Perl/5.40.0-GCCcore-14.2.0
pkgconf/1.8.0
pkgconf/2.3.0-GCCcore-14.2.0
SQLite/3.47.2-GCCcore-14.2.0
Tcl/8.6.16-GCCcore-14.2.0
UCX/1.18.0-GCCcore-14.2.0
UnZip/6.0-GCCcore-14.2.0
xorg-macros/1.20.2-GCCcore-14.2.0
reprod directories under 2025.06/software/linux/x86_64/intel/sapphirerapids/reprod
FFTW/3.3.10-GCC-14.2.0/20250821_232525UTC
GCC/14.2.0/
GCC/14.2.0/20250821_091946UTC
GCC/14.2.0/20250821_225358UTC
GCCcore/14.2.0/
GCCcore/14.2.0/20250821_091943UTC
GCCcore/14.2.0/20250821_225355UTC
libevent/2.1.12-GCCcore-14.2.0/
libevent/2.1.12-GCCcore-14.2.0/20250822_000242UTC
libfabric/2.0.0-GCCcore-14.2.0/
libfabric/2.0.0-GCCcore-14.2.0/20250821_233434UTC
libffi/3.4.5-GCCcore-14.2.0/
libffi/3.4.5-GCCcore-14.2.0/20250821_234505UTC
libidn2/2.3.7-GCCcore-14.2.0/
libidn2/2.3.7-GCCcore-14.2.0/20250821_234436UTC
libxml2/2.13.4-GCCcore-14.2.0/
libxml2/2.13.4-GCCcore-14.2.0/20250821_233126UTC
numactl/2.0.19-GCCcore-14.2.0/
numactl/2.0.19-GCCcore-14.2.0/20250821_232558UTC
OpenSSL/3/
OpenSSL/3/20250821_235951UTC
Perl/5.38.0/
Perl/5.38.0/20250821_235138UTC
Perl/5.40.0-GCCcore-14.2.0/
Perl/5.40.0-GCCcore-14.2.0/20250821_234247UTC
pkgconf/1.8.0/
pkgconf/1.8.0/20250821_234307UTC
pkgconf/2.3.0-GCCcore-14.2.0/
pkgconf/2.3.0-GCCcore-14.2.0/20250821_225432UTC
SQLite/3.47.2-GCCcore-14.2.0/
SQLite/3.47.2-GCCcore-14.2.0/20250822_000114UTC
Tcl/8.6.16-GCCcore-14.2.0/
Tcl/8.6.16-GCCcore-14.2.0/20250821_235947UTC
UCX/1.18.0-GCCcore-14.2.0/
UCX/1.18.0-GCCcore-14.2.0/20250821_232950UTC
UnZip/6.0-GCCcore-14.2.0/
UnZip/6.0-GCCcore-14.2.0/20250821_233132UTC
xorg-macros/1.20.2-GCCcore-14.2.0/
xorg-macros/1.20.2-GCCcore-14.2.0/20250821_235147UTC
other under 2025.06/software/linux/x86_64/intel/sapphirerapids
no other files in tarball
Aug 22 00:17:13 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-84506.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-jsc
Copy link

eessi-bot-jsc bot commented Aug 21, 2025

New job on instance eessi-bot-jsc for CPU micro-architecture aarch64-nvidia-grace for repository eessi.io-2025.06-software in job dir /p/project1/ceasybuilders/eessibot/jobs/2025.08/pr_1159/14011774

date job status comment
Aug 21 21:27:09 UTC 2025 submitted job id 14011774 awaits release by job manager
Aug 21 21:27:21 UTC 2025 released job awaits launch by Slurm scheduler
Aug 21 21:28:31 UTC 2025 running job 14011774 is running
Aug 21 22:17:19 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14011774.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17558143740.tar.gzsize: 84 MiB (88917501 bytes)
entries: 9015
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
FFTW/3.3.10-GCC-14.2.0.lua
libevent/2.1.12-GCCcore-14.2.0.lua
libfabric/2.0.0-GCCcore-14.2.0.lua
libffi/3.4.5-GCCcore-14.2.0.lua
libidn2/2.3.7-GCCcore-14.2.0.lua
libxml2/2.13.4-GCCcore-14.2.0.lua
make/4.4.1-GCCcore-14.2.0.lua
numactl/2.0.19-GCCcore-14.2.0.lua
OpenSSL/3.lua
Perl/5.38.0.lua
Perl/5.40.0-GCCcore-14.2.0.lua
pkgconf/1.8.0.lua
pkgconf/2.3.0-GCCcore-14.2.0.lua
SQLite/3.47.2-GCCcore-14.2.0.lua
Tcl/8.6.16-GCCcore-14.2.0.lua
UCX/1.18.0-GCCcore-14.2.0.lua
UnZip/6.0-GCCcore-14.2.0.lua
xorg-macros/1.20.2-GCCcore-14.2.0.lua
software under 2025.06/software/linux/aarch64/nvidia/grace/software
FFTW/3.3.10-GCC-14.2.0
libevent/2.1.12-GCCcore-14.2.0
libfabric/2.0.0-GCCcore-14.2.0
libffi/3.4.5-GCCcore-14.2.0
libidn2/2.3.7-GCCcore-14.2.0
libxml2/2.13.4-GCCcore-14.2.0
make/4.4.1-GCCcore-14.2.0
numactl/2.0.19-GCCcore-14.2.0
OpenSSL/3
Perl/5.38.0
Perl/5.40.0-GCCcore-14.2.0
pkgconf/1.8.0
pkgconf/2.3.0-GCCcore-14.2.0
SQLite/3.47.2-GCCcore-14.2.0
Tcl/8.6.16-GCCcore-14.2.0
UCX/1.18.0-GCCcore-14.2.0
UnZip/6.0-GCCcore-14.2.0
xorg-macros/1.20.2-GCCcore-14.2.0
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
FFTW/3.3.10-GCC-14.2.0/20250821_213958UTC
libevent/2.1.12-GCCcore-14.2.0/
libevent/2.1.12-GCCcore-14.2.0/20250821_220313UTC
libfabric/2.0.0-GCCcore-14.2.0/
libfabric/2.0.0-GCCcore-14.2.0/20250821_214552UTC
libffi/3.4.5-GCCcore-14.2.0/
libffi/3.4.5-GCCcore-14.2.0/20250821_215104UTC
libidn2/2.3.7-GCCcore-14.2.0/
libidn2/2.3.7-GCCcore-14.2.0/20250821_215051UTC
libxml2/2.13.4-GCCcore-14.2.0/
libxml2/2.13.4-GCCcore-14.2.0/20250821_214452UTC
make/4.4.1-GCCcore-14.2.0/
make/4.4.1-GCCcore-14.2.0/20250821_215454UTC
numactl/2.0.19-GCCcore-14.2.0/
numactl/2.0.19-GCCcore-14.2.0/20250821_214039UTC
OpenSSL/3/
OpenSSL/3/20250821_220124UTC
Perl/5.38.0/
Perl/5.38.0/20250821_215432UTC
Perl/5.40.0-GCCcore-14.2.0/
Perl/5.40.0-GCCcore-14.2.0/20250821_214955UTC
pkgconf/1.8.0/
pkgconf/1.8.0/20250821_215011UTC
pkgconf/2.3.0-GCCcore-14.2.0/
pkgconf/2.3.0-GCCcore-14.2.0/20250821_212952UTC
SQLite/3.47.2-GCCcore-14.2.0/
SQLite/3.47.2-GCCcore-14.2.0/20250821_220234UTC
Tcl/8.6.16-GCCcore-14.2.0/
Tcl/8.6.16-GCCcore-14.2.0/20250821_220122UTC
UCX/1.18.0-GCCcore-14.2.0/
UCX/1.18.0-GCCcore-14.2.0/20250821_214358UTC
UnZip/6.0-GCCcore-14.2.0/
UnZip/6.0-GCCcore-14.2.0/20250821_214456UTC
xorg-macros/1.20.2-GCCcore-14.2.0/
xorg-macros/1.20.2-GCCcore-14.2.0/20250821_215459UTC
other under 2025.06/software/linux/aarch64/nvidia/grace
no other files in tarball
Aug 21 22:17:19 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-14011774.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@boegel
Copy link
Contributor Author

boegel commented Aug 22, 2025

For both sapphirerapids and grace, build of Python/3.13.1-GCCcore-14.2.0 failed, will figure out what's wrong...

@boegel
Copy link
Contributor Author

boegel commented Aug 22, 2025

Same problem in both builds: some tests that are being run (during the build step) are failing:

LD_LIBRARY_PATH=/tmp/eessibot/easybuild/build/Python/3.13.1/GCCcore-14.2.0/Python-3.13.1 ./python -m test --pgo --timeout=
...
0:00:13 load avg: 1.33 [16/44] test_embed
test test_embed failed
0:00:14 load avg: 1.33 [17/44] test_float -- test_embed failed (71 failures)

@boegel
Copy link
Contributor Author

boegel commented Aug 26, 2025

If I bypass the problem with Python (which is a build dependency for BLIS), I also see trouble when building BLIS on A64FX:

Generated include/a64fx/blis.h
Compiling obj/a64fx/config/a64fx/bli_cntx_init_a64fx.o ('a64fx' CFLAGS for config code)
Compiling obj/a64fx/kernels/armsve/1m/bli_dpackm_armsve256_int_8xk.o ('a64fx' CFLAGS for kernels)
Compiling obj/a64fx/kernels/armsve/1m/bli_dpackm_armsve512_asm_10xk.o ('a64fx' CFLAGS for kernels)
Compiling obj/a64fx/kernels/armsve/1m/bli_dpackm_armsve512_asm_16xk.o ('a64fx' CFLAGS for kernels)
Compiling obj/a64fx/kernels/armsve/3/bli_armsve_utils.o ('a64fx' CFLAGS for kernels)
kernels/armsve/1m/bli_dpackm_armsve512_asm_16xk.c: In function ‘bli_dpackm_armsve512_asm_16xk’:
kernels/armsve/1m/bli_dpackm_armsve512_asm_16xk.c:70:15: error: assignment to ‘void *’ from ‘long unsigned int’ makes pointer from integer without a cast [-Wint-conversion]
   70 |             p = ( (uint64_t)0x1 << 56 ) | (uint64_t)p;
      |               ^
compilation terminated due to -Wfatal-errors.

@ocaisa
Copy link
Member

ocaisa commented Aug 27, 2025

We need to add a hook to enable --enable-ipv6 for OpenMPI (or make a PR upstream and use that)

@boegel
Copy link
Contributor Author

boegel commented Aug 29, 2025

More details on the failing test_embed failure that occurs during the Python build, when I obtained by running the following command in the interactive debug shell that EasyBuild provides:

./python -m test --timeout= --match test_embed --verbose3

It's basically the same issue over and over again:

stderr:
/tmp/vsc40023/easybuild_build/Python/3.13.1/GCCcore-14.2.0/Python-3.13.1/Programs/_testembed:
symbol lookup error: /tmp/vsc40023/easybuild_build/Python/3.13.1/GCCcore-14.2.0/Python-3.13.1/Programs/_testembed: 
undefined symbol: __gcov_indirect_call

I found an ancient bug that mentions that combining --enable-optimizations and --enable-shared may be the culprit, but given it's age, I'm not sure that's still relevant.

https://python-forum.io/thread-21472.html mentions that this could happen when a non-clean build directory is used, which shouldn't be the case here (unless there's a bug in the way the ./python -m test --pgo is being launched.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97722 comes to a similar conclusion, and suggests it may be due to mixing of compiler versions being used...

After a bit of digging, it seems like the problem occurs because -lgcov isn't being linked in when the _testembed binary gets built. That seems to somehow be connected to having $LIBS defined in the build environment to -lm -lpthread.

When I run unset LIBS (and remove the existing Programs/_testembed) before re-running make -j 16 in the interactive debug shell, the build proceeds after test_embed triggered by the ./python -m test --pgo passes...

I'm not sure why this problem only pops up when building Python on top of EESSI.
Maybe it has something to do with only having the libgcov.a static library in the 2025.06 compat layer:

$ find /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64 -name 'libgcov.*' 2> /dev/null
/cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/usr/lib/gcc/x86_64-pc-linux-gnu/13/libgcov.a

That's also the case for EESSI 2023.06 though:

$ find /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64 -name 'libgcov.*' 2> /dev/null
/cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/usr/lib/gcc/x86_64-pc-linux-gnu/10/libgcov.a

@boegel
Copy link
Contributor Author

boegel commented Aug 29, 2025

However, using this in the Python easyconfig does not actually fix the problem, so maybe the problem isn't that $LIBS set after all...

prebuildopts = "unset LIBS && "

Indeed, also when trying via the interactive debug shell, just the act of removing Programs/_testembed and then re-running make fixes the problem (even without unsetting $LIBS)...

@boegel
Copy link
Contributor Author

boegel commented Aug 30, 2025

One more data point: the problem doesn't occur when I'm using eb --disable-rpath on top of EESSI-extend (without using the build container), but does happen when RPATH linking is left enabled.

That suggests it may be a bug of some kind in the EasyBuild RPATH wrappers?

@ocaisa
Copy link
Member

ocaisa commented Sep 24, 2025

@boegel I wonder if there is a chance this might be fixed with rpath wrapper updates that will come with EB 5.1.2?

@boegel
Copy link
Contributor Author

boegel commented Sep 24, 2025

@boegel I wonder if there is a chance this might be fixed with rpath wrapper updates that will come with EB 5.1.2?

I did a quick test using current develop branch for EasyBuild framework, and that didn't help.

I'll try to dig a bit more in the coming days, I'll also open a dedicated issue on this since there's probably some kind of lesson to learn from this one...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2025.06-software.eessi.io 2025.06 version of software.eessi.io
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants