Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash on exit in debug mode #17876

Closed
zisoft opened this issue Nov 22, 2024 · 12 comments
Closed

crash on exit in debug mode #17876

zisoft opened this issue Nov 22, 2024 · 12 comments
Labels
priority: high core features are broken and not usable at all, software crashes reproduce: confirmed a way to make the bug re-appear 99% of times has been found
Milestone

Comments

@zisoft
Copy link
Collaborator

zisoft commented Nov 22, 2024

Describe the bug

When built with Debug mode, dt crashes reproducible on exit in line 195 of src/common/dtpthread.h:

// need to unlock last, to shield our internal data.
const int ret = pthread_mutex_unlock(&(mutex->mutex));
assert(!ret);
return ret;
}

Steps to reproduce

  • build darktable with Debug mode
  • clean config
  • start dt, do nothing, quit
  • crash

Expected behavior

No response

Logfile | Screenshot | Screencast

Assertion failed: (!ret), functiAssertion failed: (!ret), functiAssertion failed: (!ret), function dt_pthread_mutex_unlock_with_Assertion failed: (!ret), function dt_pthread_mutex_unlock_with_caller, file dtpthread.h, line 1caller, file dtpthread.h, line 1Assertion failed: (!ret), functi95.
on dt_pthread_mutex_unlock_with_Assertion failed: (!ret), functi95.
on dt_pthread_mutex_unlock_with_caller, file dtpthread.h, line 1Assertion failed: (!ret), function dt_pthread_mutex_unlock_with_95.
caller, file dtpthread.h, line 1on dt_pthread_mutex_unlock_with_Assertion failed: (!ret), functicaller, file dtpthread.h, line 195.
95.
on dt_pthread_mutex_unlock_with_caller, file dtpthread.h, line 1caller, file dtpthread.h, line 195.
on dt_pthread_mutex_unlock_with_caller, file dtpthread.h, line 195.

Commit

No response

Where did you obtain darktable from?

self compiled

darktable version

current master

What OS are you using?

Mac

What is the version of your OS?

macOS 15.1.1

Describe your system?

No response

Are you using OpenCL GPU in darktable?

Yes

If yes, what is the GPU card and driver?

No response

Please provide additional context if applicable. You can attach files too, but might need to rename to .txt or .zip

No response

@zisoft
Copy link
Collaborator Author

zisoft commented Nov 22, 2024

@ralfbrown: First bad commit is

commit 898757fd0cfd71810740cb69e1fb28cf42454bf6
Author: Ralf Brown <[email protected]>
Date:   Sat Nov 16 07:58:53 2024 -0500

    fix early-exit crash in dt_control_quit and dt_control_shutdown
    
    Add synchronization variables to avoid an early user-requested exit
    causing a crash from trying to cleanup threads etc. which haven't
    actually been initialized yet.  Also avoid calling gtk_main() if we've
    already started shutdown by the time that point is reached, as that
    would cause a hang with no window displayed.

@zisoft
Copy link
Collaborator Author

zisoft commented Nov 23, 2024

I tracked down and found the reason for the crash in src/control/control.c:

void dt_control_shutdown(dt_control_t *s)
{
if(!s)
return;
dt_pthread_mutex_lock(&s->cond_mutex);
dt_pthread_mutex_lock(&s->run_mutex);
const gboolean was_running = s->running;
s->running = FALSE;
dt_pthread_mutex_unlock(&s->run_mutex);
dt_pthread_mutex_unlock(&s->cond_mutex);
pthread_cond_broadcast(&s->cond);
/* first wait for gphoto device updater */
#ifdef HAVE_GPHOTO2
pthread_join(s->update_gphoto_thread, NULL);
#endif
if(!was_running)
return; // if not running, there are no threads to join
/* then wait for kick_on_workers_thread */
pthread_join(s->kick_on_workers_thread, NULL);
for(int k = 0; k < s->num_threads; k++)
// pthread_kill(s->thread[k], 9);
pthread_join(s->thread[k], NULL);
for(int k = 0; k < DT_CTL_WORKER_RESERVED; k++)
// pthread_kill(s->thread_res[k], 9);
pthread_join(s->thread_res[k], NULL);
}

If I comment out 319+320 the crash is gone.

@ralfbrown: Not sure what was your intention here. It shouldn't hurt to wait for the threads to finish?

@ralfbrown
Copy link
Collaborator

But we were getting a hang because the threads had never been created if the user managed to request a program exit fast enough.....

Which of the join calls is actually causing the assertion failure? Do you still get the issue if you move line 323 above 319/320?

@zisoft
Copy link
Collaborator Author

zisoft commented Nov 23, 2024

Which of the join calls is actually causing the assertion failure?

The assertion failure is triggered when none of the join calls is made due to the early return above.

Do you still get the issue if you move line 323 above 319/320?

That does not help, but when I move the first loop above, the crash is gone.
So the code now looks as follows:

void dt_control_shutdown(dt_control_t *s)
{
  if(!s)
    return;
  dt_pthread_mutex_lock(&s->cond_mutex);
  dt_pthread_mutex_lock(&s->run_mutex);
  const gboolean was_running = s->running;
  s->running = FALSE;
  dt_pthread_mutex_unlock(&s->run_mutex);
  dt_pthread_mutex_unlock(&s->cond_mutex);
  pthread_cond_broadcast(&s->cond);

  /* first wait for gphoto device updater */
#ifdef HAVE_GPHOTO2
  pthread_join(s->update_gphoto_thread, NULL);
#endif

  /* then wait for kick_on_workers_thread */
  pthread_join(s->kick_on_workers_thread, NULL);

  for(int k = 0; k < s->num_threads; k++)
    // pthread_kill(s->thread[k], 9);
    pthread_join(s->thread[k], NULL);

  if(!was_running)
    return;		// if not running, there are no threads to join

  for(int k = 0; k < DT_CTL_WORKER_RESERVED; k++)
    // pthread_kill(s->thread_res[k], 9);
    pthread_join(s->thread_res[k], NULL);
}

@ralfbrown
Copy link
Collaborator

So now the question is whether it is possible with this change to trigger a crash or hang by typing Ctrl-Q or clicking on the main window's close button in the interval between the window appearing and it fully drawing. You can extend the interval during which you could click by disabling the splash screen (Ctrl-Q isn't affected, but the splash screen hides the initial appearance of the main window).

@zisoft
Copy link
Collaborator Author

zisoft commented Nov 24, 2024

I am not able to trigger this hang at all. Even not if the early return is commented out.

  // if(!was_running)
  //   return;		// if not running, there are no threads to join

So this change works ok on my Mac.

Just another hint which I found during my analysis: In src/common/darktable.c in function dt_cleanup() we have this code:

dt_image_cache_cleanup(darktable.image_cache);
free(darktable.image_cache);
darktable.image_cache = NULL;
dt_mipmap_cache_cleanup(darktable.mipmap_cache);
free(darktable.mipmap_cache);
darktable.mipmap_cache = NULL;
if(init_gui)
{
dt_imageio_cleanup(darktable.imageio);
free(darktable.imageio);
darktable.imageio = NULL;
dt_control_shutdown(darktable.control);
dt_control_cleanup(darktable.control);
free(darktable.control);
darktable.control = NULL;
dt_undo_cleanup(darktable.undo);
darktable.undo = NULL;
free(darktable.gui);
darktable.gui = NULL;
}

If I put a breakpoint at line 2047 and quit darktable, the breakpoint gets hit. Clicking continue then quits darktable without the crash.
Setting the breakpoint at line 2050 makes the crash back again.
So we seem to have a race condition here. Delaying the execution of dt_imageio_cleanup(darktable.imageio); avoids the crash.

And indeed, putting a g_usleep(1000000); in front of it avoids the crash as well:

  if(init_gui)
  {
    g_usleep(1000000);
    dt_imageio_cleanup(darktable.imageio);
    free(darktable.imageio);
    darktable.imageio = NULL;
    dt_control_shutdown(darktable.control);
    dt_control_cleanup(darktable.control);
    free(darktable.control);
    darktable.control = NULL;
    dt_undo_cleanup(darktable.undo);
    darktable.undo = NULL;
    free(darktable.gui);
    darktable.gui = NULL;
  }

This is of course not a solution but maybe it helps to understand whats going on.

@zisoft
Copy link
Collaborator Author

zisoft commented Nov 25, 2024

The assertion failure is also reproducible on Windows and the change described above (move the early return between the two for-loops) fixes it as well.

@jenshannoschwalm
Copy link
Collaborator

So now the question is whether it is possible with this change to trigger a crash or hang by typing Ctrl-Q or clicking on the main window's close button in the interval between the window appearing and it fully drawing.

YES. If i remove the

if(!was_running)
    return;		// if not running, there are no threads to join

part to later i can easily trigger the same error - by pressing CTRL_Q - we had before @ralfbrown check work leading to a crash in

  /* then wait for kick_on_workers_thread */
  pthread_join(s->kick_on_workers_thread, NULL);

@jenshannoschwalm jenshannoschwalm added this to the 5.0 milestone Nov 29, 2024
@jenshannoschwalm jenshannoschwalm added priority: high core features are broken and not usable at all, software crashes reproduce: confirmed a way to make the bug re-appear 99% of times has been found labels Nov 29, 2024
@jenshannoschwalm
Copy link
Collaborator

Just mentioning, on release build today I observed one crash in OpenCL and one with double free. No easy to trigger and can't provide a log unfortunately. But, more hints to a race condition while closing ...

@zisoft
Copy link
Collaborator Author

zisoft commented Dec 4, 2024

Did some further analysis here, unfortunately still without a real success.
Some issues I found:

  • several accesses to darktable.control->running throughout the code without a surrounding mutex lock
  • darktable.control->global_mutex is not destroyed in dt_control_cleanup()
  • setting a breakpoint in dt_control_cleanup() and single-stepping through the mutex_destroy calls gives assertion failures in dt_pthread_mutex_lock_with_caller in random order, not reproducible.

@ralfbrown
Copy link
Collaborator

A quick check leads me to believe that the biggest problem with control->running is that it gets set to TRUE before initialization has been completed. The mutex protecting that particular update is used only in that location, so using atomic access functions for reading/writing the variable (and not setting it to TRUE until the init is actually complete) is likely to fix any race conditions due to unprotected accesses.

@jenshannoschwalm
Copy link
Collaborator

@zisoft i think you can relax, i think i found the problem. PR following tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: high core features are broken and not usable at all, software crashes reproduce: confirmed a way to make the bug re-appear 99% of times has been found
Projects
None yet
Development

No branches or pull requests

3 participants