Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test script using mpi #49

Closed
suvenduat opened this issue Dec 10, 2014 · 19 comments
Closed

Test script using mpi #49

suvenduat opened this issue Dec 10, 2014 · 19 comments

Comments

@suvenduat
Copy link

Hi Johannes,

I am having problem to run pymltinest using MPI. Without MPI, everything is perfect.
I got the same error like issues 6 using the same demo script:
#6
I think MPI is correctly install in server.
I could not get what modification I should make to run it properly.
Could you please help me to solve this problem? Thank you.

Regards,
Suvendu Rakshit

@JohannesBuchner
Copy link
Owner

Can you repeat what your exact error is, and which MultiNest version you use?

@suvenduat
Copy link
Author

I have MultiNest v3.6

I got many memory management errors:

*** glibc detected *** /usr/bin/python2.7: free(): invalid next size (normal): 0x00000000030fd660 ***
*** glibc detected *** /usr/bin/python2.7: free(): invalid next size (normal): 0x0000000003baf660 ***
======= Backtrace: =========
======= Backtrace: =========
/lib64/libc.so.6[0x3c6a676166]
/lib64/libc.so.6[0x3c6a676166]
/lib64/libc.so.6[0x3c6a678ca3]
/softs/intel/composer_xe_2015.0.090/compiler/lib/intel64/libifcore.so.5(for_deallocate+0xc7)[0x2ac1e3dff967]
/softs/multinest-v3.6/lib/libmultinest.so(posterior_mp_pos_samp_+0x7ea7)[0x2ac1e2e929c7]
/softs/multinest-v3.6/lib/libmultinest.so(nested_mp_clusterednest_+0x1f3f3)[0x2ac1e2e47153]
/softs/multinest-v3.6/lib/libmultinest.so(nested_mp_nestsample_+0x8a1)[0x2ac1e2e27721]
/softs/multinest-v3.6/lib/libmultinest.so(nested_mp_nestrun_+0xd72)[0x2ac1e2e26ae2]
/softs/multinest-v3.6/lib/libmultinest.so(run+0x2ce)[0x2ac1e2e253ce]
/softs/python2.7/lib/python2.7/lib-dynload/_ctypes.so(ffi_call_unix64+0x4c)[0x2ac1e13eaf0c]
/softs/python2.7/lib/python2.7/lib-dynload/_ctypes.so(ffi_call+0x678)[0x2ac1e13e85f8]
/softs/python2.7/lib/python2.7/lib-dynload/ctypes.so(ctypes_callproc+0x2aa)[0x2ac1e13dceba]
/softs/python2.7/lib/python2.7/lib-dynload/ctypes.so(+0xe49e)[0x2ac1e13d749e]
/usr/bin/python2.7(PyObject_Call+0x5d)[0x42451d]
/usr/bin/python2.7[0x515fae]
/usr/bin/python2.7(PyEval_EvalFrameEx+0xab3)[0x50f393]
/lib64/libc.so.6[0x3c6a678ca3]
/softs/intel/composer_xe_2015.0.090/compiler/lib/intel64/libifcore.so.5(for_deallocate+0xc7)[0x2ae7b50b3967]
/usr/bin/python2.7(PyEval_EvalCodeEx+0x765)[0x5171f5]
/softs/multinest-v3.6/lib/libmultinest.so(posterior_mp_pos_samp
+0x7ea7)[0x2ae7b41469c7]
/usr/bin/python2.7[0x5161e5]
/softs/multinest-v3.6/lib/libmultinest.so(nested_mp_clusterednest
+0x1f3f3)[0x2ae7b40fb153]
/usr/bin/python2.7(PyEval_EvalFrameEx+0xab3)[0x50f393]
/softs/multinest-v3.6/lib/libmultinest.so(nested_mp_nestsample
+0x8a1)[0x2ae7b40db721]
/usr/bin/python2.7(PyEval_EvalCode+0x38d)[0x50e5cd]
/usr/bin/python2.7(PyRun_SimpleFileExFlags+0x2b0)[0x55a750]
/usr/bin/python2.7(PyRun_AnyFileExFlags+0xc3)[0x559e03]
/usr/bin/python2.7(Py_Main+0xa56)[0x415606]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3c6a61ed1d]
/usr/bin/python2.7[0x414949]
======= Memory map: ========
00400000-006ae000 r-xp 00000000 00:14 6291525 /softs/python2.7/bin/python2.7
008ad000-008f3000 rw-p 002ad000 00:14 6291525 /softs/python2.7/bin/python2.7
008f3000-00902000 rw-p 00000000 00:00 0
01807000-03167000 rw-p 00000000 00:00 0 [heap]
344ba00000-344bbb9000 r-xp 00000000 08:02 658004 /usr/lib64/libcrypto.so.1.0.1e
344bbb9000-344bdb8000 ---p 001b9000 08:02 658004 /usr/lib64/libcrypto.so.1.0.1e
344bdb8000-344bdd3000 r--p 001b8000 08:02 658004 /usr/lib64/libcrypto.so.1.0.1e
344bdd3000-344bddf000 rw-p 001d3000 08:02 658004 /usr/lib64/libcrypto.so.1.0.1e
344bddf000-344bde3000 rw-p 00000000 00:00 0
344be00000-344be62000 r-xp 00000000 08:02 692413 /usr/lib64/libssl.so.1.0.1e
344be62000-344c061000 ---p 00062000 08:02 692413 /usr/lib64/libssl.so.1.0.1e
344c061000-344c065000 r--p 00061000 08:02 692413 /usr/lib64/libssl.so.1.0.1e
344c065000-344c06c000 rw-p 00065000 08:02 692413 /usr/lib64/libssl.so.1.0.1e
.............
.............
2ae7c0e73000-2ae7c1073000 ---p 000dd000 00:14 6298264 /softs/python2.7/lib/python2.7/site-packages/matplo
tlib/backends/_backend_agg.so
2ae7c1073000-2ae7c1077000 rw-p 000dd000 00:14 6298264 /softs/python2.7/lib/python2.7/site-packages/matplo
tlib/backends/_backend_agg.so
2ae7c1077000-2ae7c1079000 rw-p 00000000 00:00 0
7fffb2e30000-7fffb2e50000 rwxp 00000000 00:00 0 [stack]
7fffb2e50000-7fffb2e54000 rw-p 00000000 00:00 0
7fffb2f3c000-7fffb2f3d000 r-xp 00000000 00:00 0 [vdso]

ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

@suvenduat
Copy link
Author

What do you think about this problem? I got also this message: APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)

@JohannesBuchner
Copy link
Owner

Please make sure you have installed mpi4py.
How do you run python? With mpiexec/mpirun?
You can also try setting init_MPI to False in pymultinest.run.

@suvenduat
Copy link
Author

I am using mpiexec. With 1 core, the script is working fine. But with core>1, it terminates. Using mpiexec, I am running other programs, which are not related to multinest and they are working fine. So, I think, it's not a problem of mpi installation. Just for clarification, here is my ".oar" file to submit the job.


#!/bin/bash
#OAR -l/core=2,walltime=24
source /softs/env_default.sh
mpiexec.hydra -machinefile $OAR_FILE_NODES -bootstrap ssh
-bootstrap-exec /usr/bin/oarsh
-envall /usr/bin/python2.7 /home/rakshit/run.py


@JohannesBuchner
Copy link
Owner

It is still not clear to me whether the correct multinest library is loaded (mpi version vs non-mpi version).

see first few lines of https://github.com/JohannesBuchner/PyMultiNest/blob/master/pymultinest/run.py .

The meaning of the error depends on

  • which libmultinest.so is really loaded -- cmake creates two libraries flavours.
  • whether python or fortran, or both initialise MPI (init_MPI parameter)

You can find out which library is loaded with lsof -p . I suspect you are not using the MPI version of the multinest library.

@suvenduat
Copy link
Author

The problem is that the program terminates very quickly and I can't get any
information from lsof -p
This software version is installed in the central computing centre so we
will try to
reinstall it and will let you know about it soon. I will also try to
install in my mac.

On Fri, Dec 12, 2014 at 2:01 PM, Johannes Buchner [email protected]
wrote:

It is still not clear to me whether the correct multinest library is
loaded (mpi version vs non-mpi version).
see first few lines of
https://github.com/JohannesBuchner/PyMultiNest/blob/master/pymultinest/run.py
. The meaning of the error depends on

  • which libmultinest.so is really loaded -- cmake creates two
    libraries flavours.
  • whether python or fortran, or both initialise MPI (init_MPI
    parameter) You can find out which library is loaded with lsof -p . I
    suspect you are not using the MPI version of the multinest library.


Reply to this email directly or view it on GitHub
#49 (comment)
.

@suvenduat
Copy link
Author

I reinstalled multinest and pymultinest in mac 16 GB macbook pro. Without mpi it was perfect as as before but with "mpirun -np 2 python demo.py" commend I got the following errors, which is exactly the same as written #45 but no clear solution to this is written!
Python(88238,0x7fff72fa2310) malloc: *** error for object 0x10b83f008: incorrect checksum for freed object - object was probably modified after being freed.

*** set a breakpoint in malloc_error_break to debug


mpirun noticed that process rank 1 with PID 88238 on node dhcp2-139 exited on signal 6 (Abort trap: 6).

@JohannesBuchner
Copy link
Owner

It is still not clear to me whether the correct multinest library is loaded (mpi version vs non-mpi version).

see first few lines of https://github.com/JohannesBuchner/PyMultiNest/blob/master/pymultinest/run.py .

The meaning of the error depends on

  • which libmultinest.so is really loaded -- cmake creates two libraries flavours.
  • whether python or fortran, or both initialise MPI (init_MPI parameter)

You can find out which library is loaded with lsof -p . I suspect you are not using the MPI version of the multinest library.


You can also use ldd or strace.

@jtlz2
Copy link

jtlz2 commented Dec 18, 2014

@suvenduat Are you using mpi4py? You probably need to. My skeleton pymultinest runs go something like this:

    from mpi4py import MPI # mpi4py does the init
    import pymultinest

Then supply init_MPI=False to pymultinest.run()

Invoke as:

mpiexec -np NUMPROCS python code.py

At one point I had different behaviour with the supposedly synonymous mpiexec and mpirun.

As @JohannesBuchner says, make sure you're using the MPI version of libmultinest.so - can you confirm?

Let me know if you need any more. It was painful to figure out but I now routinely have pymultinest running across two or more 48-core nodes....

@suvenduat
Copy link
Author

@jtlz2 Great.. pymultinest working perfectly using mpiexec in my mac after taking your suggestion (it's clearly explained) and @JohannesBuchner suggestion.
Thanks a lot. I was fighting a lot to work it out.
However, I still have problem to run in our cluster. Using the same script and command as in my mac mpiexec -np 2 demo.py, I am getting these errors in our cluster:
mpiexec_gurney: cannot connect to local mpd (/tmp/mpd2.console_rakshit); possible causes:

  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)

I should check which mpiexec is this and the reason for it!! Any idea will be helpful.
Thanks again.

@JohannesBuchner
Copy link
Owner

The line

 from mpi4py import MPI

should not be needed, because it is already inside pymultinest.
I guess I should set init_MPI to False automatically whenever mpi4py has been loaded (or always?).

@suvenduat
Copy link
Author

I thinks you should automatically set init_MPI=False always.

@suvenduat
Copy link
Author

Running successfully now in cluster. Thank you.

@JohannesBuchner
Copy link
Owner

If you can leave any advice for the next person trying to get pymultinest to run with MPI it would be highly appreciated.

@suvenduat
Copy link
Author

Must install mpi4py to run using MPI. Supply init_MPI=False to pymultinest.run()
Then run it using the following command:
mpiexec -np NPROCS python code.py
In cluster, I used mpiexec.hydra and submitted my job using up to 20 nodes.

@JohannesBuchner
Copy link
Owner

Great, thank you.

In the next version I will add a MPI section to the manual, and change the code to set init_MPI to false by default.

@JohannesBuchner
Copy link
Owner

Dear Suvendu,
I just released an update to PyMultiNest, which includes a fix to this problem. Could you upgrade and test it?
Thank you, Johannes

@suvenduat
Copy link
Author

Dear Johannes,
I just installed and tested. It's working perfectly.

Cheers,
Suv

On Fri, Jan 9, 2015 at 1:58 PM, Johannes Buchner [email protected]
wrote:

Dear Suvendu,
I just released an update to PyMultiNest, which includes a fix to this
problem. Could you upgrade and test it?
Thank you, Johannes


Reply to this email directly or view it on GitHub
#49 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants