-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a Data Transport Layer to DYAD to support different ways of transferring data #24
Conversation
@JaeseungYeom I've removed the last bit of excess logging and retested. So, this PR is now ready-for-review. |
Thanks to some discussion with @grondo, we identified that a lot of symbols are being exported from DYAD's libraries that shouldn't be exported. I am in the process of correcting this. One question related to symbol exporting @JaeseungYeom: what do we want to do about |
I think you can treat dyad_sync_directory similarly to other DAYD internal functions. |
Notes from review:
|
If 3 is available and 2 is working, do we still need 1? |
I believe they give different errors:
|
I agree with that trying to see if option 3 is available first and then do option 2. For 1, it sounds to me it does not add any other functionality if either 2 or 3 works. If endpoint setup fails, would the entire DYAD shutdown or do we expect to retry? If the latter, I see it could be useful. |
The only benefit is detection. Hanging errors are hard to detect. And this would lead to one such case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add ```#if defined(__cplusplus)
extern "C" {
#endif // defined(__cplusplus)
#if defined(__cplusplus)
};
#endif // defined(__cplusplus)``` to headers. I am still going through this PR, but you can start addressing.
src/core/dyad_core.c
Outdated
const dyad_ctx_t* ctx, | ||
const dyad_kvs_response_t* restrict kvs_data, | ||
const char** file_data, | ||
int* file_len, | ||
flux_future_t** f) | ||
size_t* file_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add restrict. Not just here but any non const pointer in any other function prototypes.
Also, there is restrict used somewhere else. I remember you did some checking in configure where restrict is available or restrict is available. You need to unify these to the one detected as available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a separate PR if that works better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added restrict to the latest commits. However, I should mention that, due to stuff with the Flux handle, I can't add it in many places without telling the compiler incorrect things (which will break the build). I think there were only 3 places where I could definitely add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can show me the problem later.
src/core/dyad_core.c
Outdated
fprintf (stderr, "Invalid DTL mode provided through %s. \ | ||
Defaulting to UCX\n", DYAD_DTL_MODE_ENV); | ||
} | ||
dtl_mode = DYAD_DTL_UCX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we had this discussion, and somehow agreed to use UCX as the default. However, now that I think about it, it might be better to use flux rpc because it will be available as long as dyad depends on flux. I know this will change in the future but I am not sure if UCX is universally available. If you think that is the case, this is fine. Otherwise, it should be flux rpc. Or we somehow detect the UCX dependency has been picked up during configuration, a MACRO variable can be set which helps to determine the appropriate default. If needed, this could be another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having UCX as default makes sense for the time being. Most systems nowadays have UCX, mainly because its used by MPI. Also, right now, the UCX dependency is required, so DYAD won't build without UCX.
We can think more about this later, but I think that should be a separate PR if we change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libfabric is another to consider. AWS EFA is based on libfabric for example. So, we should definitely detect what is available and use it as the default instead of assuming UCX availability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is can be a followup PR. Even if libfabric is not supported, UCX availability should be checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requiring UCX as the dependency is also problematic in the short term. Vanessa would not be able to try this if that is the case.
src/core/dyad_core.c
Outdated
flux_future_t *f; | ||
json_t* rpc_payload; | ||
DYAD_LOG_INFO (ctx, "Packing payload for RPC to DYAD module"); | ||
rc = ctx->dtl_handle->rpc_pack ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is somewhat akward.
I will have to look into rpc_pack() to see why it has to work this way.
However, if ctx->dtl_handle->rpc_pack() always expects ctx->dtl_handle to operate correctly, there can be a better interface.
Edit: I see why you have to do that. We can merge it as is and try find a better interface later. Least we can do is using some macro or a global variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree its awkward. I spent quite a bit of time looking for an alternative, but I couldn't find one.
A macro could work for simplifying this, but I'd highly recommend against a global variable. We don't necessarily want the context to always be shared across functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like global variable approach as well. However, context does not need to be shared for the sake of that. Only dtl handle need to be global, which will be read-only once initialized.
src/dtl/flux_dtl.c
Outdated
"upath", | ||
upath | ||
); | ||
if (errcode < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Check this with a macro (DYAD_IS_ERROR) as it is done some other places. Also, rc is better than errorcode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are similar lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check should not use DYAD_IS_ERROR
because it's not checking a DYAD return code. It's checking a Flux return code. The choice of variable name was also intentional to emphasize this difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not need to be DYAD_IS_ERROR. It can be FLUX_IS_ERROR for example. error code is misleading because it is not necessarily a code for error like errno.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed a new commit that deals with this.
|
||
if (flux_msg_get_userid (msg, &userid) < 0) | ||
goto error; | ||
if (!flux_msg_is_streaming (msg)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a discussion on what the best terms is for "streaming" later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. FWIW, in this context, the term "streaming" refers to Flux's streaming RPCs (i.e., an RPC which can send multiple return messages and is ended with an error return message with errno set to ENODATA
).
@@ -17,11 +21,14 @@ | |||
do { \ | |||
} while (0) | |||
#else | |||
#define DYAD_LOG_INFO(dyad_ctx, ...) \ | |||
flux_log (dyad_ctx->h, LOG_INFO, __VA_ARGS__) | |||
#define DYAD_LOG_INFO(dyad_ctx, ...) flux_log (dyad_ctx->h, LOG_INFO, __VA_ARGS__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are not spaces after function-like macros in flux-core
, I'd argue that we shouldn't add spaces in DYAD either.
|
||
typedef enum dyad_core_return_codes dyad_rc_t; | ||
|
||
#define DYAD_IS_ERROR(code) ((code) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space
|
||
#define DYAD_IS_ERROR(code) ((code) < 0) | ||
|
||
#define FLUX_IS_ERROR(code) ((code) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space
These sublibraries provide a common interface to sending/receiving data using different tools. The tools currently supported in the DTL sublibraries are: * Flux RPC (i.e., how DYAD has previously moved data) * UCX To control which tool is used, users can set the DYAD_DTL_MODE environment variable for the APIs. For the module/service, users specify which tool to use by passing a second argument on the command line. For both APIs and the module/service, the default DTL mode is Flux RPC.
…codes for DYAD based on DTL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some of my feedback in the code.
src/core/dyad_core.c
Outdated
goto get_done; | ||
} | ||
DYAD_LOG_INFO (ctx, "Receive RPC response from DYAD module"); | ||
rc = dyad_dtl_recv_rpc_response(ctx->dtl_handle, f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do a ftell on the file and get the size and then we can do the response back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. However, we already discussed doing this as part of a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do this alongside #30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with both of you.
.clang-format
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this feels like just formatting change which switches the order. I would recommend to revert this as its unrelated to the actual change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There actually are a handful of actual changes to .clang-format
. Those changes resolve several complications/annoyances, particularly regarding the PerfFlow Aspect annotations messing up return type formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Ian's approach is reasonable to avoid manual formatting and productivity decrease although I do not like the solution for the long term. We want the annotation line separate from the function signature, and we want to avoid manual formatting. So far, there is no solution known to us. So, this is an alternative.
$(FLUX_CORE_LIBS) | ||
libdyad_core_la_CFLAGS = \ | ||
$(AM_CFLAGS) \ | ||
-I$(top_srcdir)/src/utils \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding -I for finding header files within such cases would make it harder to install the correct files later. The convension is to use till src only in -I so that other libraries can correctly link to DYAD at runtime.
Another general project level comment is that it is good to make distinction between public header files and private header files within a project.
This will help the installer only install the header files expected to be used by applications and rest would not be installed in include but would be compiled together in the so itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I absolutely agree with you @hariharan-devarajan, but I did this because we wanted to stay (relatively) consistent with Flux. This flux-core
repo does something similar to this in its Makefile.am
s.
However, regardless of what Flux does and how much we want to follow that, I also believe that any changes we might want to make regarding organization of files should belong in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should put more effort to clean up installation and make distinction between public headers and private headers. This can be worked on in a followup PR.
#include <libgen.h> | ||
#include <unistd.h> | ||
|
||
#include "dyad_core.h" | ||
#include "dyad_flux_log.h" | ||
#include "dyad_dtl_impl.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The our internal header would change to refer from src
#include dtl/dyad_dtl_impl.h
#include utils/murmur3.h
and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I better understand what you're saying about the headers now that I read this comment, but, again, I believe such a change should be its own PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with both of you. It needs to be changed and should be done in a separate PR.
@@ -199,8 +142,7 @@ int open (const char *path, int oflag, ...) | |||
} | |||
|
|||
if (!(ctx && ctx->h) || (ctx && !ctx->reenter)) { | |||
IPRINTF (ctx, "DYAD_SYNC: open sync not applicable for \"%s\".\n", | |||
path); | |||
IPRINTF (ctx, "DYAD_SYNC: open sync not applicable for \"%s\".\n", path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the same Logging mechanism DYAD_LOG_INFO
@@ -238,7 +180,8 @@ FILE *fopen (const char *path, const char *mode) | |||
} | |||
|
|||
if (!(ctx && ctx->h) || (ctx && !ctx->reenter) || !path) { | |||
IPRINTF (ctx, "DYAD_SYNC: fopen sync not applicable for \"%s\".\n", | |||
IPRINTF (ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DYAD_LOG_INFO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should explore unifying the logging mechanisms in DYAD. I'll create an issue for that so we can track it as future work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created #44 to track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logging is carry-over from my old code, and it needs to be refreshed and cleaned up. Some of the things got partially removed over refactoring processes. We can discuss a better structure and approaches for future-proofness, succinctness, and clarity.
} | ||
// if (ctx == NULL) { | ||
// dyad_wrapper_init (); | ||
// } | ||
|
||
func_ptr = (fopen_ptr_t)dlsym (RTLD_NEXT, "fopen"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this logic of moving to next symbol can be abstracted using macro for each case.
#define MAP_OR_FAIL(func_) \
if (!(real_##func_##_)) { \
real_##func_##_ = (real_t_##func_##_)dlsym(RTLD_NEXT, #func_); \
if (!(real_##func_##_)) { \
fprintf(stderr, "failed to map symbol\n"); \
} \
}
#else
#define MAP_OR_FAIL(func)
#endif
I recommend using GOTCHA tool which handles this more dynamically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that switching the GOTCHA in the future is the right move. That definitely doesn't belong in this PR though. If we don't have one yet, I'll create an issue for switching from "raw" LD_PRELOAD
to GOTCHA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue #43 has been created to track adding GOTCHA support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It make sense to address it in a separate PR. @hariharan-devarajan please find time with us to explain the benefits from the users perspective as well as developer's.
* Abstracts DYAD module callback name to a const global variable in dyad_dtl_impl.h * Removes an unneeded comment from the Makefile.am in wrapper * Removes commented-out code that invokes dyad_wrapper_init from wrapper.c * Updates header guards to be consistant
This PR adds a new Data Transport Layer (DTL) to the DYAD module and DYAD Core library to allow us to support different ways to transfer data (i.e., transfer backends) between producer and consumer. Through this DTL, this PR also adds a new UCX-based transfer backend.
To set the backends, users will set the new
DYAD_DTL_MODE
environment variable. This variable must be set to the same value for both the clients (i.e., user applications that use the C or C++ APIs) and the DYAD module. Valid values are:FLUX_RPC
(default): uses Flux's RPC framework to transfer data. This is the same as beforeUCX
(default): uses UCX to transfer dataThe code for the DTL can be found in
src/core/dtl
(for DYAD Core's side) andsrc/modules/dtl
(for the DYAD module).This PR also adds a couple of QOL improvements. Most notably, it adds a new
dyad_init_env
function to the Core library. This function will initialize the DYAD context (dyad_ctx_t
) using the environment variables defined insrc/core/dyad_env.h
. By providing this function in Core, we can define a baseline environment variable-based initialization for all APIs.