saslauthd crashing in lmdb #1401

citrus-it · 2024-02-11T11:00:48Z

A user has reported that saslauthd is crashing fairly regularly, and has provided a core file.

The stack from that core shows that we're dying with a SEGV at off set 0x50 in mdb_env_share_locks:

> ::status
debugging core file of saslauthd (64-bit) from mail
file: /opt/ooce/sbin/saslauthd
initial argv: /opt/ooce/sbin/saslauthd -a sasldb -c -m /var/run/saslauthd
threading model: native threads
status: process terminated by SIGSEGV (Segmentation Fault), addr=20

> $C
fffffc7feef30990 liblmdb.so`mdb_env_share_locks+0x50()
fffffc7feef30a00 liblmdb.so`mdb_env_open+0x306()
fffffc7feef30a90 do_open+0x23f()
fffffc7feef30b40 _sasldb_getdata+0x10d()
fffffc7feef31230 auth_sasldb+0xd1()
fffffc7feef314f0 do_auth+0x7d()
fffffc7feef31d80 do_request+0x2d2()
0000000000000000 libc.so.1`__door_return+0x50()

The SEGV is at address 0x20, and a low address like that usually indicates a NULL pointer dereference, where we're attempting to look at a member of a struct at that offset.

Unfortunately this binary wasn't compiled with all debugging features - it was actually built with clang which is a bit less useful from a debugging perspective than if it was built with the illumos-patched gcc, but let's see what we can get.

If we look at the disassembly of mdb_env_share_locks up to the address where we crashed:

> liblmdb.so`mdb_env_share_locks::dis
liblmdb.so`mdb_env_share_locks: pushq  %rbp
liblmdb.so`mdb_env_share_locks+1:       movq   %rsp,%rbp
liblmdb.so`mdb_env_share_locks+4:       pushq  %r15
liblmdb.so`mdb_env_share_locks+6:       pushq  %r14
liblmdb.so`mdb_env_share_locks+8:       pushq  %r13
liblmdb.so`mdb_env_share_locks+0xa:     pushq  %r12
liblmdb.so`mdb_env_share_locks+0xc:     pushq  %rbx
liblmdb.so`mdb_env_share_locks+0xd:     subq   $0x48,%rsp
liblmdb.so`mdb_env_share_locks+0x11:    movq   %rdi,%r14
liblmdb.so`mdb_env_share_locks+0x14:    movq   +0x12e63(%rip),%rax      <0xfffffc7fee0168b0>
liblmdb.so`mdb_env_share_locks+0x1b:    movq   (%rax),%rax
liblmdb.so`mdb_env_share_locks+0x1e:    movq   %rax,-0x30(%rbp)
liblmdb.so`mdb_env_share_locks+0x22:    movq   0x48(%rdi),%rax
liblmdb.so`mdb_env_share_locks+0x26:    movq   0x80(%rax),%rax
liblmdb.so`mdb_env_share_locks+0x2d:    movq   0x50(%rdi),%rcx
liblmdb.so`mdb_env_share_locks+0x31:    xorl   %edx,%edx
liblmdb.so`mdb_env_share_locks+0x33:    cmpq   0x80(%rcx),%rax
liblmdb.so`mdb_env_share_locks+0x3a:    movq   %rsi,%rbx
liblmdb.so`mdb_env_share_locks+0x3d:    setb   %dl
liblmdb.so`mdb_env_share_locks+0x40:    movq   0x48(%rdi,%rdx,8),%rax
liblmdb.so`mdb_env_share_locks+0x45:    movq   0x80(%rax),%rax
liblmdb.so`mdb_env_share_locks+0x4c:    movq   0x40(%rdi),%rcx
liblmdb.so`mdb_env_share_locks+0x50:    movq   %rax,0x20(%rcx)

We're looking 20 (hex) bytes into whatever is in %rcx, is that NULL?

> <rcx=J
                0

Yep, let's try and work out which bit of source corresponds to this. It's early in the function, which is nice:

/** Downgrade the exclusive lock on the region back to shared */
static int ESECT
mdb_env_share_locks(MDB_env *env, int *excl)                                    {                                                                                       int rc = 0;
        MDB_meta *meta = mdb_env_pick_meta(env);

        env->me_txns->mti_txnid = meta->mm_txnid;

That dereference of me_txns looks likely. Is that NULL? Even without more debugging information available, we have died early enough in the function that the first argument is still in %rdi.

> <rdi::print MDB_env
{
    me_fd = 0x6
    me_lfd = 0xffffffff
    me_mfd = 0xffffffff
    me_flags = 0x30024000
    me_psize = 0x1000
    me_os_psize = 0x1000
    me_maxreaders = 0x7e
    me_close_readers = 0
    me_numdbs = 0x2
    me_maxdbs = 0x2
    me_pid = 0x876
    me_path = 0x11d40c0 "/var/opt/ooce/sasl2/sasldb2"
    me_map = 0xfffffc7fee159000
    me_txns = 0
    me_metas = [ 0xfffffc7fee159010, 0xfffffc7fee15a010 ]
    me_pbuf = 0
    me_txn = 0
    me_txn0 = 0
    me_mapsize = 0x100000
    me_size = 0
    me_maxpg = 0x100
    me_dbxs = 0x11f6510
    me_dbflags = 0x11d3040
    me_dbiseqs = 0x11d3060
    me_txkey = 0x3
    me_pgoldest = 0
    me_pgstate = {
        mf_pghead = 0
        mf_pglast = 0
    }
    me_dpages = 0
    me_free_pgs = 0
    me_dirty_list = 0
    me_maxfree_1pg = 0x1fd
    me_nodemax = 0x7f6
    me_live_reader = 0
    me_userctx = 0
    me_assert_func = 0
}

and yes, me_txns is indeed NULL (and the rest of the data, particularly me_path, looks ok, so we're likely looking at the right address)

What I don't know is why, and how this could be an intermittent problem. Several other places in the code do check that me_txns is not NULL before de-referencing it, but not here. That could mean that it should never be NULL at this point and there is a bug elsewhere, or it could be a missing check - without a more complete understanding of the library it's hard to determine.

As for next steps here, I would suggest looking to see if there have been changes/fixes/bugs reported against lmdb in this area, if there is a new version that might have a fix, and otherwise reporting a bug against the lmdb project. It is possible to do more investigation here with dtrace and other tools to help build up a picture of how things work around this, but it's time consuming and can be a bit of a learning curve when you jump into unfamiliar code like this.

We'll get there if necessary, but let's see if there's an available fix first.

The text was updated successfully, but these errors were encountered:

citrus-it · 2024-02-11T11:07:44Z

This looks like it could well be https://bugs.openldap.org/show_bug.cgi?id=10024

citrus-it added the bug label Feb 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

saslauthd crashing in lmdb #1401

saslauthd crashing in lmdb #1401

citrus-it commented Feb 11, 2024

citrus-it commented Feb 11, 2024

saslauthd crashing in lmdb #1401

saslauthd crashing in lmdb #1401

Comments

citrus-it commented Feb 11, 2024

citrus-it commented Feb 11, 2024