-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3Client instantiation extremely slow #2880
Comments
cc @fjetter |
Ok, so it seems the JSON parsing step that takes most of the time in the profile graphs may be spent parsing this JSON string hardcoded (!!) in the SDK's C++ source code:
It decodes into this heavily-nested object (here in Python representation): |
Thanks for pointing this out. Currently it's behaving within expected boundaries, but I'm interested to hear more of your thoughts on this. How fast are you wanting/expecting the S3Client to instantiate? You shouldn't be needing to instantiate the that often as it can be reused. I noticed in the issue linked above you have improved the performance of your tests to less than 53 µs with "caching PoC". Were there any changes you where wanting to be made on the sdk side? |
I'll let @fjetter elaborate on their situation, but when distributing individual tasks over a cluster of workers there's a need to deserialize everything that's needed to run such tasks. If a task entails loading data over S3 (with potentially different configurations, since tasks from multiple users or workloads might be in flight), it implies recreating a S3Client each time. Depending on task granularity, I suspect 1ms to instantiate a S3Client might appear as a significant contributor in performance profiles. |
For the record, here's the current prototype that seems to work on our CI. I ended up caching endpoint providers based on the S3 client configuration's relevant options (the ones that influence the provider initialization). There's an additional complication ( It would probably have been simpler if I could simply have explicitly created a shared |
That sums it up nicely. Due to how arrow and dask is built, we end up instantiating possibly thousands of clients adding up to a couple of seconds in latency whenever we're trying to read a dataset. We essentially end up creating one s3client per file, reusing it is a little difficult at this point. |
This is ultimately a feature request so I will be changing this issue to a feature-request. It's something that we would like to improve the speed of, but I don't have a timeline for when that might happen. In the short term I would recommend to use a single endpoint resolver for all of you s3clients. You can do this by overloading the client when you initialize it. |
We have a proposed workaround now, but it's slightly more complicated than that:
|
Describe the bug
One of our users has reported that creating a S3Client instance can take up to 1 millisecond. According to profiler statistics, most of the time is spent in
Aws::Crt::Endpoints::RuleEngine
(see linked issue).Expected Behavior
I would expect a
S3Client
to be reasonably fast to instantiate (less than a microsecond).Current Behavior
See description and profile graph excerpt in linked issue.
Reproduction Steps
Using PyArrow:
Possible Solution
No response
Additional Information/Context
No response
AWS CPP SDK version used
1.11.267
Compiler and Version used
gcc 12.3.0
Operating System and version
Ubuntu 22.04
The text was updated successfully, but these errors were encountered: