You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A user has reported that saslauthd is crashing fairly regularly, and has provided a core file.
The stack from that core shows that we're dying with a SEGV at off set 0x50 in mdb_env_share_locks:
> ::status
debugging core file of saslauthd (64-bit) from mail
file: /opt/ooce/sbin/saslauthd
initial argv: /opt/ooce/sbin/saslauthd -a sasldb -c -m /var/run/saslauthd
threading model: native threads
status: process terminated by SIGSEGV (Segmentation Fault), addr=20
> $C
fffffc7feef30990 liblmdb.so`mdb_env_share_locks+0x50()
fffffc7feef30a00 liblmdb.so`mdb_env_open+0x306()
fffffc7feef30a90 do_open+0x23f()
fffffc7feef30b40 _sasldb_getdata+0x10d()
fffffc7feef31230 auth_sasldb+0xd1()
fffffc7feef314f0 do_auth+0x7d()
fffffc7feef31d80 do_request+0x2d2()
0000000000000000 libc.so.1`__door_return+0x50()
The SEGV is at address 0x20, and a low address like that usually indicates a NULL pointer dereference, where we're attempting to look at a member of a struct at that offset.
Unfortunately this binary wasn't compiled with all debugging features - it was actually built with clang which is a bit less useful from a debugging perspective than if it was built with the illumos-patched gcc, but let's see what we can get.
If we look at the disassembly of mdb_env_share_locks up to the address where we crashed:
We're looking 20 (hex) bytes into whatever is in %rcx, is that NULL?
> <rcx=J
0
Yep, let's try and work out which bit of source corresponds to this. It's early in the function, which is nice:
/** Downgrade the exclusive lock on the region back to shared */staticintESECTmdb_env_share_locks(MDB_env*env, int*excl) { intrc=0;
MDB_meta*meta=mdb_env_pick_meta(env);
env->me_txns->mti_txnid=meta->mm_txnid;
That dereference of me_txns looks likely. Is that NULL? Even without more debugging information available, we have died early enough in the function that the first argument is still in %rdi.
and yes, me_txns is indeed NULL (and the rest of the data, particularly me_path, looks ok, so we're likely looking at the right address)
What I don't know is why, and how this could be an intermittent problem. Several other places in the code do check that me_txns is not NULL before de-referencing it, but not here. That could mean that it should never be NULL at this point and there is a bug elsewhere, or it could be a missing check - without a more complete understanding of the library it's hard to determine.
As for next steps here, I would suggest looking to see if there have been changes/fixes/bugs reported against lmdb in this area, if there is a new version that might have a fix, and otherwise reporting a bug against the lmdb project. It is possible to do more investigation here with dtrace and other tools to help build up a picture of how things work around this, but it's time consuming and can be a bit of a learning curve when you jump into unfamiliar code like this.
We'll get there if necessary, but let's see if there's an available fix first.
The text was updated successfully, but these errors were encountered:
A user has reported that saslauthd is crashing fairly regularly, and has provided a core file.
The stack from that core shows that we're dying with a SEGV at off set
0x50
inmdb_env_share_locks
:The SEGV is at address 0x20, and a low address like that usually indicates a NULL pointer dereference, where we're attempting to look at a member of a struct at that offset.
Unfortunately this binary wasn't compiled with all debugging features - it was actually built with clang which is a bit less useful from a debugging perspective than if it was built with the illumos-patched gcc, but let's see what we can get.
If we look at the disassembly of mdb_env_share_locks up to the address where we crashed:
We're looking 20 (hex) bytes into whatever is in %rcx, is that NULL?
Yep, let's try and work out which bit of source corresponds to this. It's early in the function, which is nice:
That dereference of me_txns looks likely. Is that NULL? Even without more debugging information available, we have died early enough in the function that the first argument is still in %rdi.
and yes,
me_txns
is indeed NULL (and the rest of the data, particularlyme_path
, looks ok, so we're likely looking at the right address)What I don't know is why, and how this could be an intermittent problem. Several other places in the code do check that me_txns is not NULL before de-referencing it, but not here. That could mean that it should never be NULL at this point and there is a bug elsewhere, or it could be a missing check - without a more complete understanding of the library it's hard to determine.
As for next steps here, I would suggest looking to see if there have been changes/fixes/bugs reported against lmdb in this area, if there is a new version that might have a fix, and otherwise reporting a bug against the lmdb project. It is possible to do more investigation here with dtrace and other tools to help build up a picture of how things work around this, but it's time consuming and can be a bit of a learning curve when you jump into unfamiliar code like this.
We'll get there if necessary, but let's see if there's an available fix first.
The text was updated successfully, but these errors were encountered: