-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Realm profiling #1777
Comments
IMHO it is definitely desirable to have Realm dump pre-digested profiling information (replacing / in addition to profiling callbacks), for visualization in legion-prof, to support the profiling of pure-Realm applications. AFAIU most of the concepts in the profiler should apply to Realm (with some exceptions, e.g. critical path information). |
A couple thoughts:
|
The answer to this is definitely a 'no'.
I think the main information that we are missing is:
I think at a minimum, this information should be somehow available through Realm's existing
I agree that I think Legion Prof can be used as a visualizer for Realm programs. I think most of the relationships that Legion describes today could even be inferred from general Realm programs (see below). I think if you ignore the Legion-specific logging statements then Legion Prof should be able to be used as a profiler for general Realm applications. If Legion Prof doesn't work just rendering Realm-only programs then we should do what we need to do in order to fix that.
Actually the critical path stuff that is there now should work on generic Realm applications too. You'll need to do some extra logging like for every |
The network profiling stuff would probably be super helpful for the slingshot-11 issues we're seeing. I would personally still like some sort of optional "live mode" for profiling realm as mentioned in #1607 as we very frequently have no information at all when we are trying to debug hangs and freezes at the scale of 2048, 4096, and 8192 nodes. Its very frustrating. |
I don't think this is the right place to have that discussion. (Probably deserves its own issue because of the complexity involved; I don't even know how you would make that work since a lot of what we need to do involves needing the whole profile logs.) It does motivate the need for Realm to continue to make all of its profiling results available dynamically through the profiling response interface though. |
Using Legion Prof to profile Realm does sound reasonable given two have a lot in common and significant effort has been put into developing Legion Prof. However, I am not convinced about the effort it may take. @elliottslaughter Is there a good pointer you can provide about the Legion Prof logging formats and some details how it's all being stored right now? |
That means we need to extend the realm profiling API for these internal profiling data such as active message, bgwork, etc. There could be two issues:
I think for realm applications, we need to directly dump the profiling data into files (in legion prof, nvtx, or whatever formats), |
This is exactly what I was talking about by asking about the format and existing infrastructure that is used to manage the underlying storage. |
The questions about generating the profiling format are for @lightsighter, as this is all code that resides in Legion at the moment. I suppose it's possible that code could be factored out to make it generally available to other users. (I don't personally think it's worth duplicating since the format is fairly specific.) |
@elliottslaughter I do not think the format @apryakhin mentioned is the Legion format. If I understand it correctly, we will by pass the Legion and let realm dump its profiling into files that can be read by Legion prof viewer directly. |
That's the format that If you want to know how to write to that format, there is exactly one place we currently do that: in the Legion code. If you want to see the parser side code, you can look at Rust, but I think it's more informative to read the serializer code that actually writes the data (which also happens to be in C++). If you're talking about the archive format produced by |
I feel like we might need to deal with the archive format if we pick legion prof viewer, because the data generated by the current realm profiling API does not 100% align with the legion prof, e.g. we do not have the notion of meta task, mapper call in realm, so we can not directly use the rust serializer to read realm data. |
@lightsighter addressed this here: #1777 (comment)
Mike and I are in agreement on this. And like I said above, a lot of the value-add for Legion Prof is in the processing stage, so you're really missing out (and duplicating work) is if you skip that. The UI is nice, and has a lot of relevant usability features (besides being very scalable), but it's fundamentally a dumb viewer. It just shows what you tell it to show. All the fancy business logic is in the core |
The Legion Prof logging format is dirt simple. There's just a bunch of structures for various kinds of logging statements, those get dumped into a zlib file, and then Legion Prof parses each of them on the other side. For a stand-alone Realm application you should be able to completely ignore all the Legion-specific logging statements (which are a minority of the statements) and just use the general Realm logging statements and Legion Prof should still be able to render things. If there are exceptions to that @elliottslaughter or I will fix them. I'm even willing to build the library on top of Realm's existing interface to do this for generic Realm programs myself if it's not obvious how to do it. I still think the most important problem we should be discussing here is how to get the data for the four questions I asked in my previous comment exposed through the Realm profiling response interface. That is going to be the hard thing to figure out. Figuring out how to render the data will be easy in comparison. |
I think 1 and 2 is not difficulty to get. I do not have an answer for 3 and 4, because I am not quite familiar with it. However, I am not sure if we want to expose them via the realm profiling API. For the bgwork, if we create a profiling response every time we pick up a bgwork item, there might be too many of them. Actually, I do not think realm users want to use profiling API to get such internal profiling data, e.g. the bgwork, because there is even no public API for bgwork. Network might be a different story, because we may attach those network data into realm copies. As I said in my previous comment #1777 (comment), I do not think realm profiling API is the right way for realm applications. I think the current realm Profiling API is more like CUPTI, where applications can use it to get online profiling data, but if CUDA users do not use CUPTI, they can still use nsight to profile CUDA applications.
I am not sure if I understand it. Realm applications do not have to use the Profiling API, but just run with your library? |
I'm suggesting that we make such a public API for profiling bgwork. I agree that it can't give a response for every single bgwork item as that will be overwhelming, hence the reason that designing this well will be challenging.
I think all Realm profiling data should be available dynamically. @syamajala made a prescient comment about potentially wanting to do online rendering of the profile. This isn't something that Legion Prof supports today, but is something that would be good to support in the near future. Legion mappers might also want dynamic information about what's going on with the background worker threads as well. I think we should always support dynamic online profiling solutions and that will also enable offline profiling as well.
Yes, the library will provide drop-in replacements for many Realm API calls and the user will use those instead of canonical Realm API calls. The library will then capture all the needed data and add necessary profiling requests to get all the information that it needs and log it out to zlib files. It would be really nice if Realm had hooks for API calls (like MPI has with PMPI) so we could easily capture this information, but I think we can work around with out it for now. |
I am OK with online profiling. I am just worried that if we want to expose so much internal data via online Profiling API, it will be challenge to keep the overhead as minimal as possible. We need to cache data into memory, and then send them to the node where the response task is executed. Regarding the PMPI style hooks, I guess it won't work until we have a ABI stable API. |
Maybe I'm missing something, but I believe the profiling API only responds to what you ask it for. So this would then be on Legion (or the standalone Realm profiling mode) to determine how much to ask for. We already have other tradeoffs in Legion where you can turn on modes that enable more profiling data to be collected. Fundamentally, if you're chasing down an issue in Realm bgwork tasks, you need to see those in the profile. Right now we can't see them at all. I'm not sure whether we should make them visible by default, but having it as an option seems important in the long run. |
Both of these are touching on why doing this kind profiling is challenging. I think we'll need to define different "resolutions" of profiling for the client to ask for since some people will want all the data regardless of the cost and some other people will want to see a summary of the data in a "compressed" form that throws away some information but ensures that we log less data and minimize overheads to the actual application execution. We're going to need some knobs to turn to adjust the granularity of the profiling data we want to record, but it should be up to the user to declare what kind of resolution that they want.
Understood. We don't need it right away. Would just be a nice feature to have eventually as it would make it possible to have this library be a completely drop-in replacement without any code changes. |
That has already been done by the Sean in the "custom bgwork profiler" branch
Fundamentally that's close to the item 4 (see below) in terms of how to handle this. We have access to all bgwork info and can collect/organize the data.
I am pretty sure this is something GASNet/UCX know to determine now. Should be as simple as the number of outstanding requests submitted to the backend vs internal limit on the number of those outstanding requests? I don't see a problem getting this information unless I am not thinking broad enough about the problem statement.
This is one is harder but not infinitely hard. To do this type of break down we just to find a way to order and store all |
Okay couple of key points discussed so far just to summarize:
|
Yes, that's correct. Whatever profiling data Realm collects, there should be a way to get at it dynamically through the profiling request interface.
I've developed a prototype version of this library here: |
We currently do not have a dedicated way to profile Realm. At this stage, the primary "client" of Realm is Legion, which uses its own custom profiling solution. For analyzing Realm, we rely on the existing logging infrastructure, which may not always be the best approach, particularly when dealing with complex applications.
This is an umbrella issue where I propose addressing the following questions: What specific information are we missing today? And, in conjunction with this, what data is the current Legion profiler lacking? Does it provide all the insights needed to fully understand what's happening within Realm?
If we reach a consensus on the need for a custom Realm profiler, we can document the requirements and assess existing solutions to determine if any are suitable—for instance, Tracy. Additionally, we've made an initial attempt to integrate Realm with Nsight Profiler by adding a library to inject NVTX tags. We need to decide whether we want to continue developing this integration. If so, will it be sufficient for all users? If not, should we maintain it alongside another profiling solution? There are a number of older attempts scattered across different branches..for example bgwork profiler or unmaintained profiling infrastructure that also needs to be looked into.
On a related note, would it be worth considering the development of a generic profiling layer within Realm that could provide "adapters" for whichever solution we choose to support (if we choose to support any)?
cc @muraj @elliottslaughter @manopapad @lightsighter @eddy16112 @magnatelee
The text was updated successfully, but these errors were encountered: