Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple clients with Margo 0.2 #33

Closed
carns opened this issue Mar 1, 2021 · 17 comments
Closed

Multiple clients with Margo 0.2 #33

carns opened this issue Mar 1, 2021 · 17 comments

Comments

@carns
Copy link
Member

carns commented Mar 1, 2021

In GitLab by @marcvef on Nov 9, 2017, 10:55

Hi,

I am using Margo as part of a preloading library, where I require two separate Margo clients that run with different NA plugins. One is running with the NA SM (shared memory) plugin, and the other with the NA BMI plugin with TCP (for now). Both are started to run within the caller's thread context. This would work with Margo 0.1 because users had to deal with initializing Argobots themselves.

Margo 0.2 starts Argobots transparantly to the user, resulting in multiple Argobots initializations and sets the caller ES to idle with ABT_snoozer each time (with ABT_snoozer_xstream_self_set()). Specifically, the latter action appears to be the culprit as the second call of ABT_snoozer_xstream_self_set() (due to the second Margo client initialization) returns the errorcode 29 ([../../src/stream.c:811] ABT_xstream_set_main_sched: 29). This aborts the whole client initialization process and margo_init() returns MARGO_INSTANCE_NULL.

I've attached a small patch (margo_0.2_multiple_clients.patch) that highlights the issue and solves it by simply checking if Argobots is already initialized and, if so, skips both mentioned actions. However, I am questioning if this is the correct way of using multiple clients in one application.

Thanks!

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @carns on Nov 9, 2017, 11:09

Hi Marc,

First of all, thanks for the patch! We'll have a look at this.

One quick comment on how the margo initialization has changed in recent versions: you might want to have a look at the margo_init_pools() function as an alternative to margo_init(). It's a little more like the older style margo init function, in that it assumes the caller has already configured the argobots and mercury resources ahead of time rather than doing it for them. This might be better for your use case?

The new margo_init() has become more of a default / convenience mode, for users that aren't interested in advanced features would rather hide the complexity of setting up those subsystems.

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @carns on Nov 9, 2017, 11:11

You might also want to track this issue on the Mercury github:

mercury-hpc/mercury#75

There is a remaining todo item on the sm plugin in mercury, which is to make it's use transparent. Once that is implemented the idea is that you would not need to explicitly manage the sm plugin, it would get used automatically when you issued an RPC to a process (via any plugin) that happened to be located on the same node.

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @marcvef on Nov 9, 2017, 11:36

Hi Philip,

Happy to help and thanks a ton for your very quick reply and suggestions!

I agree margo_init_pool() looks promising as I already had the Mercury and Argobots instances setup for the previous margo version. I will check it out further!

The Mercury issue looks interesting. Using shared memory automatically if the RPC destination is located on the same node would be very convenient for me and fits my use case perfectly.

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @shanedsnyder on Nov 9, 2017, 16:53

mentioned in commit 4537b96

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @shanedsnyder on Nov 9, 2017, 17:05

Hi Marc,

I just pushed a fix similar to your patch to the master branch of margo, in case you wanted to pull it down and try that out?

The only difference is that we set the ABT snoozer scheduler regardless of whether Argobots was initialized in `margo_init()` or by the caller. I think we want to ensure this scheduler is enabled when calling this code path -- as Phil points out, you can use `margo_init_pools()` if you want more control over the Argobots behavior.

Thanks again for the bug report!

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @marcvef on Nov 10, 2017, 09:00

Hi Shane,

Thanks! Yes, I will migrate to using margo_init_pool() to get more control. Also, I think the changes are definitely useful, and it makes sense to ensure that the ABT snoozer scheduler is used. Unfortunately, this does not solve the initial issue and other users might encounter the same problem if they use code similar to this snippet (the NA plugins are arbitrary):

margo_instance_id mid_bmi;
margo_instance_id mid_sm;
mid_bmi = margo_init("bmi+tcp", MARGO_CLIENT_MODE, 0, 0);
//[...]
mid_sm = margo_init("na+sm", MARGO_CLIENT_MODE, 0, 0);

The second margo_init() will always fail with the above-shown error when executing ABT_snoozer_xstream_self_set(). I dug a bit deeper and as far as I understand, Argobots is currently unable to set the main scheduler while the current xstream has work units in its associated pools (this is also a TODO in ABT_xstream_set_main_sched() in Argobots' src/stream.c). The then produced error makes the second margo_init() fail. The running work unit represents the Mercury progress ABT thread, triggered in the first margo_init(), and is created in margo_init_pool().

Although this behavior affects Margo, perhaps ABT_snoozer or Argobots should be responsible to deal with this issue? In the meantime, Margo or ABT_snoozer could watch out for this case and present the user with a descriptive error? Perhaps with something along these lines (margo_sched_handling.patch).

Hope this helps.

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @shanedsnyder on Nov 20, 2017, 14:52

Hi Marc,

First, apologies for taking a while to get back to you, but I was mostly distracted at SC last week.

I see what you mean re: the error with calling ABT_xstream_set_main_sched() twice, at least on the same xstream. I had looked through the code and thought that would be an idempotent call, but missed that TODO note that implies you can't call that function on the same xstream if it already has work units. I should have just tested those changes rather than trying to guess from looking at the code...

Ultimately, I think the semantic we want is that, when calling margo_init(), the calling xstream will be set to use the ABT snoozer scheduler. As you point out, we can't just blindly set the snoozer scheduler though, because a previous call to margo_init() may have already set it. It looks like your patch works around this by checking that the xstream calling margo_init() does not have any work units before setting the scheduler. I think this looks like a reasonable solution, but I do wonder if maybe this fix (or something similar) should be directly incorporated into ABT snoozer itself? This ABT_snoozer_xstream_self_set() call looks to be a convenience call (i.e., it is not mimicking any class of calls in the ABT API), so I guess the semantic is really up to us. Should it propagate back the error when attempting to set the scheduler on a running xstream, as it does now (making it up to the user to workaround this problem, a la Marc's patch)? Or should ABT snoozer explicitly avoid setting the scheduler on a running xstream, and just return success in that case (even though we can't even guarantee the running scheduler is the ABT snoozer one)?

@carns, do you have an opinion?

At any rate, thanks again for pointing out the issue and taking time to debug it. Very helpful.

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @carns on Nov 20, 2017, 15:00

My preference would be a fix in snoozer as @shanedsnyder points out. It would be great if we could have this semantic somehow:

  • call self_set() and snoozer already activated: return success
  • call self_set() and snoozer not activiated previously, but we can do it now: return success
  • call self_set() snoozer not activated, but work units prevent us from changing: return error

The first two make it idempotent at the snoozer level. The second one prevents someone from accidentally thinking they've activated snoozer when they haven't.

Margo can handle that 3rd case by printing a warning and continuing on, it doesn't have to be interpreted as a critical error at that point in the stack.

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @shanedsnyder on Nov 20, 2017, 16:54

I agree with you in principle, @carns, but I don't think we can distinguish between case 1 and case 3 above within abt-snoozer. All we can do is check whether work units prevent us from setting a new scheduler -- we can't determine whether the current scheduler is an abt-snoozer scheduler, or something else. Unless you know of a way to accomplish this? Or are advocating storing some sort of additional metadata within abt-snoozer to determine this at runtime (a list of xstreams for which abt-snoozer has been enabled, for instance)?

If not, then I have to change my mind and advocate we just leave abt-snoozer alone and integrate Marc's patch. That means the ABT_snoozer_xstream_self_set() would handle the error the same way argobots does (which I think is desirable), in that setting a scheduler on an xstream with work units scheduled results in an error. In future ABT implementations this maybe isn't an error anymore, but we won't have to update ABT snoozer if that change is made -- it just becomes an idempotent call at that point. Then, we just temporarily/permanently workaround this issue in margo using Marc's patch, with an additional modification that if the calling xstream already has work units scheduled, then a warning message is printed instead of just returning an error.

Does that sound right?

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @marcvef on Nov 29, 2017, 08:14

Hi Shane, Philip,

My apologies that it took me a few days get back to you.

From my perspective, a warning should be fine if work units are already scheduled in the xstream. But, at the same time, you wouldn't ensure that Abt-snoozer is enabled for this code path anymore (which is the reason my patch errors out). Perhaps, this can be part of the warning so that the responsibility is given to the user? The issue with that is that this might create more questions for the user who might not know what the message means, given that initializing Argobots with Abt-snoozer is abstracted away in margo_init() with Margo 0.2.

Also, while looking into the code I also couldn't find a way to check what scheduler is currently in use. This way Abt-snoozer wouldn't need to set the scheduler again if ABT_snoozer_xstream_self_set() is called. Considering that, I agree (with my limited knowledge about Argobots' code) that we cannot distinguish between cases 1 and 3.

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @carns on Jan 2, 2018, 17:19

I agree, I don't see any way to distinguish 1 and 3 either (it seems logical that you could check the scheduler type somehow, but I don't see how to do it in the abt api). We could open up an argobots issue on github to request that functionality, but in the mean time maybe all we can do is just treat the xstream_self_set() call as a "best effort" call and not fail the whole margo initialization if that subroutine fails?

That's not very satisfying because it leaves the possibility that in some case you may end up without setting the snoozer scheduler at all, but for the use cases we care about right now it would be fine.

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @shanedsnyder on Jan 4, 2018, 19:15

OK, I submitted an issue on Argobots' github to ask about this problem: pmodels/argobots#31

Until we hear back on that, my plan is to just integrate Marc's patch with the change that we do not fail the entire margo initialization, but instead just print a warning that is as helpful as possible. Not the greatest long-term solution, but I think its probably better to at least allow margo to try to continue gracefully. My understanding is that the abt-snoozer scheduler is more of an optimization on the default abt schedulers that better accounts for high-latency operations (like I/O), so I would think it would mainly just affect performance if it is not used, which doesn't seem to warrant bombing out.

Sound good?

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @carns on Jan 5, 2018, 10:39

Sounds good. If the snoozer scheduler is not activated then CPU usage will be higher but it won't affect correctness at least.

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @carns on Apr 5, 2018, 11:46

Will need to re-test this behavior of pmodels/argobots#49 is accepted in Argobots. This will change the strategy used by Margo to install schedulers.

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @mdorier on Jun 24, 2019, 09:08

Is this issue fixed, now that we don't have ABT-snoozer anymore?

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @carns on Jun 24, 2019, 10:24

I think we can deprecate this now and open a new ticket if we encounter a problem (if nothing else, the technical details / reproducer would be different now because of all of the changes since then).

@carns
Copy link
Member Author

carns commented Mar 1, 2021

In GitLab by @carns on Jun 24, 2019, 10:24

closed

@carns carns closed this as completed Mar 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant