Prominently document that `Mutex`, `Condition`, ... might not behave as expected with Domainslib #127

michael-schwarz · 2024-05-02T12:17:10Z

Thank you for this nice project, we found it quite helpful in our ongoing efforts to parallelize a fixpoint algorithm in OCaml.

A quick suggestion: It might be a good idea to prominently document that Mutex and Condition will not work out of the box as one might expect when combined with Domainslib. This will help people new to Multicore avoid going down a potentially time-consuming rabbit hole. (Apologies if there is such a remark somewhere, I re-checked and still did not find any).

Details (can be skipped by people familiar with the difference in behavior)

It took us quite a while to understand why our algorithm was not terminating and sometimes throwing exceptions, and we managed to extract this example:

open Domainslib
let main () =
  let mutex = Mutex.create () in
  let pool = T.setup_pool ~num_domains:2 () in
  let task () =
    for i = 0 to 1000 do
      (
       Mutex.lock mutex;
       let work = T.async pool (fun () -> ()) in
       Task.await pool work;
       Mutex.unlock mutex
      )
    done
  in
  Domainslib.Task.run pool (fun () ->
    let p = T.async pool (fun () -> task ()) in
    let p1 = T.async pool (fun () -> task ()) in
    let p2 = T.async pool (fun () -> task ()) in
    let p3 = T.async pool (fun () -> task ()) in
    Task.await pool p; 
    Task.await pool p1; 
    Task.await pool p2; 
    Task.await pool p3; 
  );
  ()

let _ = main ()

which will either crash with

michael@michael-XPS-13-9360:~/Documents/td-parallel$ _build/default/mutexproblem.exe 
Locking thread different from unlocking thread
Fatal error: exception Sys_error("Mutex.unlock: Operation not permitted")

or deadlock.

We had a similar problem also when we tried using a condition variable to wait until a certain number of tasks had reached a certain point, which did deadlock (for n domains) as soon as n tasks had reached that point.

After looking into how Domainslib works, it of course becomes clear that one would have to use something akin to, e.g., https://github.com/ocaml-multicore/domain-local-await.

The text was updated successfully, but these errors were encountered:

polytypic · 2024-05-03T10:18:36Z

Yes, this has been a known issue for a long time. See issue #126 here and remark here, for example.

You mentioned domain-local-await. Yes, that currently works with Domainslib and Eio. I'm currently working on Picos, which aims to provide a more comprehensive and more widely accepted solution to interoperability and replace domain-local-await and domain-local-timeout. Picos already provides replacements for the Stdlib Mutex and Condition. Unfortunately, no existing scheduler (aside from the sample schedulers in the Picos package) currently provides full compatibility with Picos. Hopefully we'll get a chance at some point to rewrite the internals of Domainslib to use Picos.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prominently document that `Mutex`, `Condition`, ... might not behave as expected with Domainslib #127

Prominently document that `Mutex`, `Condition`, ... might not behave as expected with Domainslib #127

michael-schwarz commented May 2, 2024

polytypic commented May 3, 2024

Prominently document that Mutex, Condition, ... might not behave as expected with Domainslib #127

Prominently document that Mutex, Condition, ... might not behave as expected with Domainslib #127

Comments

michael-schwarz commented May 2, 2024

polytypic commented May 3, 2024

Prominently document that `Mutex`, `Condition`, ... might not behave as expected with Domainslib #127

Prominently document that `Mutex`, `Condition`, ... might not behave as expected with Domainslib #127