Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client mount triggers Argobots assert with latest develop branch and default Argobots #657

Closed
CamStan opened this issue Aug 2, 2021 · 3 comments

Comments

@CamStan
Copy link
Member

CamStan commented Aug 2, 2021

System information

Type Version/Name
Operating System RHEL
OS Version 7
Architecture power9 & x86_64
UnifyFS Version develop

Describe the problem you're observing

On LC's power9 and x86_64 systems, starting at commit 60f17a7, we introduced or uncovered an exiting issue with how we're using Argobots.

The UnifyFS server starts fine, but upon running a client application, it consistently hangs while trying to mount, somewhere inside of invoke_client_attach_rpc().

The following message shows up in the unifyfsd.err log:

unifyfsd: ../src/include/abti_mem_pool.h:123: ABTI_mem_pool_alloc: Assertion `num_headers_in_cur_bucket >= 1' failed.

Potentially Related Issue

This is potentially related to pmodels/argobots#333 (we do initialize Argobots outside of Mochi-Margo), but initial attempts of recommended solutions therein still result in the same assert.


Describe how to reproduce the problem

  1. Use our dependency of mochi-margo, ensuring it depends on [email protected] or argobots@main.
    E.g.,
spack install mochi-margo ^libfabric fabrics=rxm,sockets,tcp ^argobots@main
  1. On a single-node allocation, start the UnifyFS server.
  2. Run the write or writeread example program with a single process.

Include any warning or errors or releveant debugging data

unifyfsd.err log contains:

unifyfsd: ../src/include/abti_mem_pool.h:123: ABTI_mem_pool_alloc: Assertion `num_headers_in_cur_bucket >= 1' failed.

The server log, unifyfsd.log appears normal up until it hangs:

2021-07-21T13:49:38 tid=102370 @ main() [unifyfs_server.c:411] server[0] - finished initialization
2021-07-21T13:50:02 tid=102372 @ unifyfs_mount_rpc() [unifyfs_client_rpc.c:142] creating new application for app_id=1511587981
2021-07-21T13:50:02 tid=102372 @ unifyfs_mount_rpc() [unifyfs_client_rpc.c:152] creating new app client for na+sm://102643/0
2021-07-21T13:50:02 tid=102372 @ unifyfs_mount_rpc() [unifyfs_client_rpc.c:163] created new application client 1511587981:1
2021-07-21T13:50:02 tid=102648 @ request_manager_thread() [unifyfs_request_manager.c:1290] I am request manager [app=1511587981:client=1] thread!
2021-07-21T13:50:02 tid=102372 @ create_mountpoint_dir() [unifyfs_client_rpc.c:75] creating global file metadata for mountpoint:
2021-07-21T13:50:02 tid=102372 @ debug_print_file_attr() [unifyfs_meta.h:116] fileattr(0x200002f4e298) - gfid=1511587981 filename=/unifyfs
2021-07-21T13:50:02 tid=102372 @ debug_print_file_attr() [unifyfs_meta.h:118]              - sz=0 mode=40755 uid=26372 gid=26372
2021-07-21T13:50:02 tid=102372 @ debug_print_file_attr() [unifyfs_meta.h:120]              - shared=1 laminated=0
2021-07-21T13:50:02 tid=102372 @ debug_print_file_attr() [unifyfs_meta.h:124]              - atime=1626900602.051364799 ctime=1626900602.051364799 mtime=1626900602.051364799
2021-07-21T13:50:02 tid=102372 @ signal_new_requests() [unifyfs_request_manager.c:251] signaling new requests
2021-07-21T13:50:02 tid=102648 @ request_manager_thread() [unifyfs_request_manager.c:1335] RM[1511587981:1] got work
2021-07-21T13:50:02 tid=102648 @ rm_process_client_requests() [unifyfs_request_manager.c:1210] processing 1 client requests
2021-07-21T13:50:02 tid=102648 @ process_metaset_rpc() [unifyfs_request_manager.c:1048] setting metadata for gfid=1511587981

The client log (writeread-gotcha in this case) appears normal until it hangs as well:

2021-07-21T13:50:02 tid=102643 @ unifyfs_mount() [unifyfs.c:2047] calling attach rpc
2021-07-21T13:50:02 tid=102643 @ invoke_client_attach_rpc() [margo_client.c:217] invoking the attach rpc function in client

Current Workaround

Using Argobots with stackguard=mprotect or stackguard=mprotect-strict (both require argobots@main) so far have prevented the assert from happening.

spack install mochi-margo ^libfabric fabrics=rxm,sockets,tcp ^argobots@main stackguard=mprotect
@CamStan CamStan changed the title Client mount triggers Argobots assert with latest develop branch default Argobots Client mount triggers Argobots assert with latest develop branch and default Argobots Aug 2, 2021
@CamStan
Copy link
Member Author

CamStan commented Aug 3, 2021

Not sure what I did when first trying the fixes in pmodels/argobots#333, but it appears setting ABT_THREAD_STACKSIZE does fix this.

#659 will likely resolve this. Will test more once merged.

@MichaelBrim
Copy link
Collaborator

I finally hit this during some testing on Summit (not every time, but once in a while). The fix from #659 resolved the issue for me.

@CamStan
Copy link
Member Author

CamStan commented Aug 9, 2021

Resolved by #659

@CamStan CamStan closed this as completed Aug 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants