-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a dedicated Python API #27
Conversation
Is this PR is off the unmerged DTL PR? The reason I am asking is it is difficult to see what changes are relevant to the purpose of the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really excited to try this out! I have mostly understanding questions at this point, and want to also ask about next steps / stage for the PR.
pydyad/pydyad/bindings.py
Outdated
dyad_core_libpath = None | ||
self.cons_path = None | ||
self.prod_path = None | ||
if DYAD_LIB_DIR is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies in advance if I'm just asking bad questions - this is the first time I'm looking here (and I'm excited to try this out). Is there any reason the user is required to export LD_LIBRARY_PATH
? Can we not do the same as the linker and look in the ld so config or default library paths? I read up on this sometime last year and was using this class to derive paths: https://github.com/vsoch/elfcall/blob/dc7383ecd6386cff9927bbf4b3b65335a45f97f4/elfcall/main/ld.py#L37
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, this was just an arbitrary decision so that I'd have an initial way to locate libdyad_core.so
. I can definitely replace this with a linker-style search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Feel free to use the class in elfcall it’s published on pypi.
pydyad/pydyad/context.py
Outdated
if not isinstance(args[0], Path): | ||
fname = Path(args[0]) | ||
fname = fname.expanduser().resolve() | ||
if kwargs["mode"] in ("r", "rb", "rt"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm new to Dyad, could you walk me through what is happening here? I'm trying to follow the test case where you submit a job with flux, and you create a consumer and produced (different jobs but to the same path?) So are they just writing / reading data from the same paths (assuming a shared filesystem) or is there a way to use some kind of different protocol? And is flux always required? And finally, how does the io.open object to relate to local_dyad_io?
Thanks for answering my questions! I hope I can help when I better understand the high level stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for asking them and thanks for the comments!
So, in general, the control- and data-flow works like this:
Producer Side:
- Producer opens a file in write-only mode (we currently don't support append or R/W, although we do have plans for the later in Support ownership transfer for RW modes #29)
- Producer closes that file
- DYAD checks if the file is within its "producer-managed path" (set by user using the
DYAD_PRODUCER_PATH
environment variable)
a. If the file is not in the producer-managed path, nothing extra happens (besides the file close) - DYAD registers information about the file in the Flux KVS (currently, it's just the file path relative to the producer-managed path as key and the Flux rank as value)
Consumer Side:
- Consumer opens a file in read-only mode
- DYAD checks if the file is within its "consumer-managed path" (set by user using the
DYAD_CONSUMER_PATH
environment variable)
a. If the file is not in the consumer-managed path, nothing extra happens (besides the file open) - DYAD looks for an entry in the Flux KVS corresponding to the file. If there is no entry available, DYAD will block until an entry is added. This ensures the consumer is synchronized on data production (can't run a piece of code if its input isn't available, after all)
a. If the KVS entry says that the consumer and producer are on the same node (as indicated by the Flux rank), the rest of the DYAD-specific stuff is skipped - DYAD creates an RPC to invoke
dyad.fetch
on the Flux broker running on the producer's node - The
dyad.fetch
callback reads the file (currently all at once, but we intend to change that in the future) and transmits it back to the consumer using DYAD's Data Transport Layer, or DTL (added as part of the soon-to-be-merged PR Adds a Data Transport Layer to DYAD to support different ways of transferring data #24). By default, the DTL will use UCX for the actual communication. - DYAD writes the file into the consumer's storage
- The actual "open" is called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of this, DYAD can work with either local or shared storage. The behavior above is what happens when using local storage. When using shared storage, we run all the steps on the producer side, but only steps 1-3 and 7 on the consumer side.
Currently, the user has to tell us when they are using shared storage using the DYAD_SHARED_STORAGE
environment variable. However, my project this summer has been to automate that decision making using an abstraction of the storage hierarchy that I'm calling the "storage graph". The initial prototype implementation of this is almost done. Just running into some symbol export issues that refuse to be fixed by libtool's -export-symbols
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this is so cool! So the actual file open and path ate mostly just for checking permissions or ownership of the path and the real content is in flux kvs.
I know you have a little more work to do, but I’ll start a testing setup of this branch in the flux operator soon to at least try to reproduce the test. Also, your description above was hugely helpful - if you don’t have it in a README or docs somewhere I highly recommend adding it.
And it probably follows in the future that something else other than flux kvs could be used for that layer, so we could potentially run this without flux as a dependency? I’m hugely interested in this for Kubernetes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this functionality is implemented in DYAD's core library (i.e., src/core
in the source code and libdyad_core.so
in an install). However, anything tied to the APIs is implemented as part of those APIs.
This is why, in dyad_open
, we have the two conditionals wrapping the consume
and produce
calls. The conditional around consume
(starting on line 29) is essentially performing steps 1 and 2 for the consumer side. Similarly, the conditional around produce
(starting on line 38) is performing step 3 and the mode check of step 1 for the producer side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, what dyad_open
is doing is this:
- Either create or use an instance of our Python FFI class (i.e., the
Dyad
class frombindings.py
). The decision is based on the values ofdyad_ctx
andregister_dyad_ctx
- Convert the filename (provided as the first positional argument to the function) to a
pathlib.Path
for convenience - Check if we should run DYAD's consumer-side code, and, if we should, invoke the corresponding call from
libdyad_core.so
through the FFI - Perform the actual file I/O. We use
io.open
to call the actual Pythonopen
function, and then we use the try-finally block to allowdyad_open
to be used as a context manager (i.e., to allow it to be used inwith
statements) - Check if we should run DYAD's producer-side code, and, if we should, invoke the corresponding call from
libdyad_core.so
through the FFi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I'm out of info dump mode, let me actually answer the specific questions you asked 😅
Q: "So are they just writing / reading data from the same paths (assuming a shared filesystem) or is there a way to use some kind of different protocol?"
A: That depends on the values provided to the environment variables DYAD_PRODUCER_PATH
, DYAD_CONSUMER_PATH
, and DYAD_SHARED_STORAGE
. You can think of the producer/consumer paths as the directories DYAD tracks to detect production/consumption events. What actual storage resources they point to is completely up to the user. So, they can point to shared storage (e.g., Lustre) or local storage (e.g., tmpfs, node-local SSD, or even El Cap Rabbit XFS). Additionally, we currently require the user to tell us if they are pointing to local or shared storage by setting DYAD_SHARED_STORAGE
(if set, storage is shared; otherwise, storage is local). This will be automated in the near future though.
Q: "And is flux always required?"
A: Yes. Right now, DYAD uses Flux's KVS and RPC services to perform (1) information sharing and (2) control messaging for data transfer respectively. There has been some talk about looking for other options, particularly for the KVS service. But, we have no concrete plans to replace either the KVS or RPC at this time, and there are no plans to move away from Flux entirely.
Q: "And finally, how does the io.open object to relate to local_dyad_io?"
A: They aren't really related in any way. Essentially, local_dyad_io
is used to perform the DYAD operations, and io.open
is used to perform the "real" file I/O operations. The key detail is that dyad_open
as a whole is essentially the glue that ensures that DYAD operations are correctly called (using local_dyad_io
) at the correct points in the file I/O process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, thanks to writing all this out, I realized the the path checks on lines 30-31 and 39-40 are redundant because they are already performed in libdyad_core.so
. I'll change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I’m always happy to serve as the rubber duck! 🦆 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the dependency on Flux for KVS and RPC, options being considered include foundation db and grpc
ef60cd8
to
acbd0f4
Compare
Is is still work in progress PR? |
All that's left is testing to make sure it works correctly. Haven't gotten to that yet. |
983303e
to
3fd1e60
Compare
@JaeseungYeom @hariharan-devarajan @vsoch now that the DTL PR is merged, I've fixed up this PR. It's now fully ready for review. |
tests/pydyad_spsc/run.sh
Outdated
# FLUX: --output=pydyad_spsc_test.out | ||
# FLUX: --error=pydyad_spsc_test.err | ||
|
||
DYAD_INSTALL_LIBDIR="/g/g90/lumsden1/ws/insitu_benchmark/dyad/_test_install/lib" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an environment variable. Then, check if the directory and a representative file exists under it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed this
tests/pydyad_spsc/run.sh
Outdated
|
||
DYAD_INSTALL_LIBDIR="/g/g90/lumsden1/ws/insitu_benchmark/dyad/_test_install/lib" | ||
KVS_NAMESPACE="pydyad_test" | ||
CONS_MANAGED_PATH="/l/ssd/lumsden1/pydyad_test" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with these variables. No hardcoded path to one of yours.
The least you can do is to rely on "${USER} instead of your id. However, that still does not work if there is no /l/ssd available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These paths were hardcoded because this was mainly my script for testing on Corona. I've changed it to be environment variable-based
EOM | ||
|
||
flux kvs namespace create $KVS_NAMESPACE | ||
flux exec -r all flux module load $DYAD_INSTALL_LIBDIR/dyad.so \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is misleading although I know it works and why you do it in this way.
There could be multiple options we can look into, and I mentioned this to Hari as well.
Option 1: One is to allocate two nodes and explicitly launch modules for producer and consumer to exclusive ranks with corresponding managed paths. You may run commands via flux exec -r
. However, the tricky part is that you need to run one of producer and consumer in background if you were to run them concurrently.
Option 2: Create a script for the producer and another one for the consumer. Inside of each script you check if dyad module exists and launch one if not, with the corresponding managed path.
In general, we may assume the same path for consumer and producer. This is the typical use case we envision. However, it does not work with the testing using a single node. That is why you are using producer's managed path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this PR require rebasing?
Edit: NVM. I see it has been rebased.
Given the urgency, I will go over the PR quick and merge unless there is anything critical.
However, just by a glance, I see a couple of things to improve on in the future.
- The check on the data file should be stronger. Currently, the only check is to see if the size each file is the same. The check should look into the content of each file which should be based on some level of randomization. Consider creating the integer sequence from a random initial value from a set of seed agreed between the producer and the consumer. I am not suggesting to generate every number randomly but only the first value in each file. Each data file should be generated by using a different seed for its first value. Then, have consumer checks if the content matches the expectation or at least if the hash is the same.
- The CI tests are borrowed from tutorial, and this interface has a separate directory under test. This needs to be better organized. Perhaps, create a directory for C and C++ under test as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the urgency, I will just merge this as is. However, I left some comments for further improvements.
This PR adds a dedicated Python API to DYAD. This API will not be built/installed by Autotools. Instead, it will use setuptools/pip for its build/install process. This will allow us to host the Python API on PyPI if desired.
In terms of the interface, there are only 3 things a user may have to interact with:
dyad_open
functiondyad_open_local
functionDyad
classMost users will only use
dyad_open
. This function is a drop-in replacement for the built-inopen
function. Internally, this function is a "context manager" (see this page for more info). It creates a globalDyad
object (if not created previously) and uses theDyad
object and built-inopen
function to carry out data production/consumption. When usingdyad_open
, DYAD will always be configured with environment variables.The
dyad_open_local
function is provided to allow users to programmatically configure DYAD. When usingdyad_open_local
, users will first manually create aDyad
object and initialize it with theinit
orinit_env
methods. Then, users will invokedyad_open_local
. This function takes the same arguments as the built-inopen
plus an additional positional argument calleddyad_ctx
. This argument is the last required positional (i.e., it comes after all positionals from the built-inopen
function), and it accepts a pre-initializedDyad
object.This PR is currently work-in-progress. The API is mostly complete, but packaging (i.e., setuptools stuff) and basic testing are still required.