crash on exit in debug mode #17876

zisoft · 2024-11-22T12:24:18Z

Describe the bug

When built with Debug mode, dt crashes reproducible on exit in line 195 of src/common/dtpthread.h:

Lines 193 to 198 in e079290

    
             // need to unlock last, to shield our internal data. 
        
             const int ret = pthread_mutex_unlock(&(mutex->mutex)); 
        
             assert(!ret); 
        
             return ret; 
        
           }

Steps to reproduce

build darktable with Debug mode
clean config
start dt, do nothing, quit
crash

Expected behavior

No response

Logfile | Screenshot | Screencast

Assertion failed: (!ret), functiAssertion failed: (!ret), functiAssertion failed: (!ret), function dt_pthread_mutex_unlock_with_Assertion failed: (!ret), function dt_pthread_mutex_unlock_with_caller, file dtpthread.h, line 1caller, file dtpthread.h, line 1Assertion failed: (!ret), functi95.
on dt_pthread_mutex_unlock_with_Assertion failed: (!ret), functi95.
on dt_pthread_mutex_unlock_with_caller, file dtpthread.h, line 1Assertion failed: (!ret), function dt_pthread_mutex_unlock_with_95.
caller, file dtpthread.h, line 1on dt_pthread_mutex_unlock_with_Assertion failed: (!ret), functicaller, file dtpthread.h, line 195.
95.
on dt_pthread_mutex_unlock_with_caller, file dtpthread.h, line 1caller, file dtpthread.h, line 195.
on dt_pthread_mutex_unlock_with_caller, file dtpthread.h, line 195.

Commit

No response

Where did you obtain darktable from?

self compiled

darktable version

current master

What OS are you using?

Mac

What is the version of your OS?

macOS 15.1.1

Describe your system?

No response

Are you using OpenCL GPU in darktable?

Yes

If yes, what is the GPU card and driver?

No response

Please provide additional context if applicable. You can attach files too, but might need to rename to .txt or .zip

No response

The text was updated successfully, but these errors were encountered:

zisoft · 2024-11-22T12:25:02Z

@ralfbrown: First bad commit is

commit 898757fd0cfd71810740cb69e1fb28cf42454bf6
Author: Ralf Brown <[email protected]>
Date:   Sat Nov 16 07:58:53 2024 -0500

    fix early-exit crash in dt_control_quit and dt_control_shutdown
    
    Add synchronization variables to avoid an early user-requested exit
    causing a crash from trying to cleanup threads etc. which haven't
    actually been initialized yet.  Also avoid calling gtk_main() if we've
    already started shutdown by the time that point is reached, as that
    would cause a hang with no window displayed.

zisoft · 2024-11-23T10:08:30Z

I tracked down and found the reason for the crash in src/control/control.c:

darktable/src/control/control.c

Lines 303 to 332 in 2702ef6

    
           void dt_control_shutdown(dt_control_t *s) 
        
           { 
        
             if(!s) 
        
               return; 
        
             dt_pthread_mutex_lock(&s->cond_mutex); 
        
             dt_pthread_mutex_lock(&s->run_mutex); 
        
             const gboolean was_running = s->running; 
        
             s->running = FALSE; 
        
             dt_pthread_mutex_unlock(&s->run_mutex); 
        
             dt_pthread_mutex_unlock(&s->cond_mutex); 
        
             pthread_cond_broadcast(&s->cond); 
        
             /* first wait for gphoto device updater */ 
        
           #ifdef HAVE_GPHOTO2 
        
             pthread_join(s->update_gphoto_thread, NULL); 
        
           #endif 
        
             if(!was_running) 
        
               return;		// if not running, there are no threads to join 
        
             /* then wait for kick_on_workers_thread */ 
        
             pthread_join(s->kick_on_workers_thread, NULL); 
        
             for(int k = 0; k < s->num_threads; k++) 
        
               // pthread_kill(s->thread[k], 9); 
        
               pthread_join(s->thread[k], NULL); 
        
             for(int k = 0; k < DT_CTL_WORKER_RESERVED; k++) 
        
               // pthread_kill(s->thread_res[k], 9); 
        
               pthread_join(s->thread_res[k], NULL); 
        
           }

If I comment out 319+320 the crash is gone.

@ralfbrown: Not sure what was your intention here. It shouldn't hurt to wait for the threads to finish?

ralfbrown · 2024-11-23T16:53:18Z

But we were getting a hang because the threads had never been created if the user managed to request a program exit fast enough.....

Which of the join calls is actually causing the assertion failure? Do you still get the issue if you move line 323 above 319/320?

zisoft · 2024-11-23T17:18:07Z

Which of the join calls is actually causing the assertion failure?

The assertion failure is triggered when none of the join calls is made due to the early return above.

Do you still get the issue if you move line 323 above 319/320?

That does not help, but when I move the first loop above, the crash is gone.
So the code now looks as follows:

void dt_control_shutdown(dt_control_t *s)
{
  if(!s)
    return;
  dt_pthread_mutex_lock(&s->cond_mutex);
  dt_pthread_mutex_lock(&s->run_mutex);
  const gboolean was_running = s->running;
  s->running = FALSE;
  dt_pthread_mutex_unlock(&s->run_mutex);
  dt_pthread_mutex_unlock(&s->cond_mutex);
  pthread_cond_broadcast(&s->cond);

  /* first wait for gphoto device updater */
#ifdef HAVE_GPHOTO2
  pthread_join(s->update_gphoto_thread, NULL);
#endif

  /* then wait for kick_on_workers_thread */
  pthread_join(s->kick_on_workers_thread, NULL);

  for(int k = 0; k < s->num_threads; k++)
    // pthread_kill(s->thread[k], 9);
    pthread_join(s->thread[k], NULL);

  if(!was_running)
    return;		// if not running, there are no threads to join

  for(int k = 0; k < DT_CTL_WORKER_RESERVED; k++)
    // pthread_kill(s->thread_res[k], 9);
    pthread_join(s->thread_res[k], NULL);
}

ralfbrown · 2024-11-23T21:36:13Z

So now the question is whether it is possible with this change to trigger a crash or hang by typing Ctrl-Q or clicking on the main window's close button in the interval between the window appearing and it fully drawing. You can extend the interval during which you could click by disabling the splash screen (Ctrl-Q isn't affected, but the splash screen hides the initial appearance of the main window).

zisoft · 2024-11-24T08:29:48Z

I am not able to trigger this hang at all. Even not if the early return is commented out.

  // if(!was_running)
  //   return;		// if not running, there are no threads to join

So this change works ok on my Mac.

Just another hint which I found during my analysis: In src/common/darktable.c in function dt_cleanup() we have this code:

darktable/src/common/darktable.c

Lines 2039 to 2059 in 584d614

    
           dt_image_cache_cleanup(darktable.image_cache); 
        
           free(darktable.image_cache); 
        
           darktable.image_cache = NULL; 
        
           dt_mipmap_cache_cleanup(darktable.mipmap_cache); 
        
           free(darktable.mipmap_cache); 
        
           darktable.mipmap_cache = NULL; 
        
           if(init_gui) 
        
           { 
        
             dt_imageio_cleanup(darktable.imageio); 
        
             free(darktable.imageio); 
        
             darktable.imageio = NULL; 
        
             dt_control_shutdown(darktable.control); 
        
             dt_control_cleanup(darktable.control); 
        
             free(darktable.control); 
        
             darktable.control = NULL; 
        
             dt_undo_cleanup(darktable.undo); 
        
             darktable.undo = NULL; 
        
             free(darktable.gui); 
        
             darktable.gui = NULL; 
        
           }

If I put a breakpoint at line 2047 and quit darktable, the breakpoint gets hit. Clicking continue then quits darktable without the crash.
Setting the breakpoint at line 2050 makes the crash back again.
So we seem to have a race condition here. Delaying the execution of dt_imageio_cleanup(darktable.imageio); avoids the crash.

And indeed, putting a g_usleep(1000000); in front of it avoids the crash as well:

  if(init_gui)
  {
    g_usleep(1000000);
    dt_imageio_cleanup(darktable.imageio);
    free(darktable.imageio);
    darktable.imageio = NULL;
    dt_control_shutdown(darktable.control);
    dt_control_cleanup(darktable.control);
    free(darktable.control);
    darktable.control = NULL;
    dt_undo_cleanup(darktable.undo);
    darktable.undo = NULL;
    free(darktable.gui);
    darktable.gui = NULL;
  }

This is of course not a solution but maybe it helps to understand whats going on.

zisoft · 2024-11-25T07:46:44Z

The assertion failure is also reproducible on Windows and the change described above (move the early return between the two for-loops) fixes it as well.

jenshannoschwalm · 2024-11-27T06:20:21Z

So now the question is whether it is possible with this change to trigger a crash or hang by typing Ctrl-Q or clicking on the main window's close button in the interval between the window appearing and it fully drawing.

YES. If i remove the

if(!was_running)
    return;		// if not running, there are no threads to join

part to later i can easily trigger the same error - by pressing CTRL_Q - we had before @ralfbrown check work leading to a crash in

  /* then wait for kick_on_workers_thread */
  pthread_join(s->kick_on_workers_thread, NULL);

jenshannoschwalm · 2024-11-30T17:34:27Z

Just mentioning, on release build today I observed one crash in OpenCL and one with double free. No easy to trigger and can't provide a log unfortunately. But, more hints to a race condition while closing ...

zisoft · 2024-12-04T17:52:28Z

Did some further analysis here, unfortunately still without a real success.
Some issues I found:

several accesses to darktable.control->running throughout the code without a surrounding mutex lock
darktable.control->global_mutex is not destroyed in dt_control_cleanup()
setting a breakpoint in dt_control_cleanup() and single-stepping through the mutex_destroy calls gives assertion failures in dt_pthread_mutex_lock_with_caller in random order, not reproducible.

ralfbrown · 2024-12-04T18:53:29Z

A quick check leads me to believe that the biggest problem with control->running is that it gets set to TRUE before initialization has been completed. The mutex protecting that particular update is used only in that location, so using atomic access functions for reading/writing the variable (and not setting it to TRUE until the init is actually complete) is likely to fix any race conditions due to unprotected accesses.

jenshannoschwalm · 2024-12-05T21:03:56Z

@zisoft i think you can relax, i think i found the problem. PR following tomorrow.

jenshannoschwalm added this to the 5.0 milestone Nov 29, 2024

jenshannoschwalm added priority: high core features are broken and not usable at all, software crashes reproduce: confirmed a way to make the bug re-appear 99% of times has been found labels Nov 29, 2024

jenshannoschwalm mentioned this issue Dec 6, 2024

Fixing three regressions / bugs #17961

Merged

jenshannoschwalm closed this as completed Dec 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crash on exit in debug mode #17876

crash on exit in debug mode #17876

zisoft commented Nov 22, 2024 •

edited

Loading

zisoft commented Nov 22, 2024

zisoft commented Nov 23, 2024

ralfbrown commented Nov 23, 2024

zisoft commented Nov 23, 2024 •

edited

Loading

ralfbrown commented Nov 23, 2024

zisoft commented Nov 24, 2024

zisoft commented Nov 25, 2024

jenshannoschwalm commented Nov 27, 2024

jenshannoschwalm commented Nov 30, 2024

zisoft commented Dec 4, 2024

ralfbrown commented Dec 4, 2024

jenshannoschwalm commented Dec 5, 2024

crash on exit in debug mode #17876

crash on exit in debug mode #17876

Comments

zisoft commented Nov 22, 2024 • edited Loading

Describe the bug

Steps to reproduce

Expected behavior

Logfile | Screenshot | Screencast

Commit

Where did you obtain darktable from?

darktable version

What OS are you using?

What is the version of your OS?

Describe your system?

Are you using OpenCL GPU in darktable?

If yes, what is the GPU card and driver?

Please provide additional context if applicable. You can attach files too, but might need to rename to .txt or .zip

zisoft commented Nov 22, 2024

zisoft commented Nov 23, 2024

ralfbrown commented Nov 23, 2024

zisoft commented Nov 23, 2024 • edited Loading

ralfbrown commented Nov 23, 2024

zisoft commented Nov 24, 2024

zisoft commented Nov 25, 2024

jenshannoschwalm commented Nov 27, 2024

jenshannoschwalm commented Nov 30, 2024

zisoft commented Dec 4, 2024

ralfbrown commented Dec 4, 2024

jenshannoschwalm commented Dec 5, 2024

zisoft commented Nov 22, 2024 •

edited

Loading

zisoft commented Nov 23, 2024 •

edited

Loading