-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable the Spark Operator to launch applications using user-defined mechanisms beyond the default spark-submit #2337
Comments
Thanks for the thoughtful proposal. If the Go native way of creating Spark driver is inherently faster than Spark-submit, I would argue that it makes sense to use that as the default. Do you see any limitations compared to spark-submit? |
@yuchaoran2011 - Thanks for your response. So there are a couple of reasons:
Please let me know if you have anymore questions. Thank you. |
All valid points. Maintaining feature parity with spark-submit will be a long-term effort that the community is willing to shoulder. Before contributing the code changes, I think the community can decide on a direction more easily if you can provide a document detailing the overall design (i.e. how the pluggable driver creation mechanism works) and the benchmarking you did comparing your implementation with the current code base. Tagging the rest of the maintainers for their thoughts as well @ChenYi015 @vara-bonthu @jacobsalway |
I think this is a great feature, but I still think it should be opt-in. And it does work, but there were some minor differences which I don't recall currently (I believe it was something related to volume mounts on pod templates) but we had to implement. The point is - spark-submit behavior can be very different per configuration, to make a replacement that would be equivalent for all usage types would require an on-going effort. Also - for now spark operator can support multiple spark versions without any issues. With this approach it would mean if spark submit would change in the future, spark operator would be need to either need to know which version of spark it runs and change its behavior accordingly, or to be coupled with spark version. |
@yuchaoran2011 - We'll work on a design and share it with the community. Thanks. |
Hey Guys - we have the design doc ready. Please feel free to add any comments/feedback. Also, added a PR link the doc that demonstrates what the changes would eventually look-like. Thanks. fyi @yuchaoran2011 |
@c-h-afzal Requested access. Could you open up read-only access for everyone so that the community can review and comment? |
@yuchaoran2011 Ah, I'd love to but given the doc. is on Salesforce account, the company policy restricts public access. Let me check within if there's a way for me to make this doc public. In the meanwhile, I have given you access. |
Happy New Year guys! :) - Wondering if you guys got a chance to go through the doc and have feedback to share? I have requested for the doc to be accessible without requesting access but I have to jump through some red-tape/approvals before the doc. would be free of the request-access-wall. Thanks. |
Hi @c-h-afzal - I read the design doc and it looks good |
Hi @c-h-afzal, please can you open the comment access to your design doc for Kubeflow Discuss Google group ? |
@andreyvelich - Done. Kindly see if you can access now. My apologies but I was unable to make it universally accessible given Salesforce internal policies. I can't change the access levels myself and from talking to the support staff doesn't seem like the company allows for more easy accessibility than what the document already has. |
@c-h-afzal Love the proposal! I’ve left a comment in the document. Would love to get more feedback from other contributors as well On a similar topic, we’re planning to conduct load tests on Spark Operator to test various scenarios and measure launch times under high concurrency of jobs. We’ve secured some capacity using AWS EKS for this purpose. We’ll share the load test documentation with the broader community soon. Would you be able to provide examples where latency has been observed (e.g., number of concurrent jobs, total number of pods, etc.)? We plan to test these with the new v2.1.0 tag and repeat the same test after future performance improvement like your proposal. @bnetzi Please mention your load scenarios as well. Thanks |
Hi @vara-bonthu , so we've tested the following workload, the results were impressive enough for us to deploy the new operator on our production clusters, and we are using it for 2 weeks now. We haven't saved the exact results as it was obvious it was good enough for us, but I do have the summary. The workload was chosen to be similar to our peek production workload in terms of app creation rate per cluster: Our demo spark app - Submit Rate:
Our specs: We are using an EKS cluster so we can't know the actual k8s control plane size, but we were told by support that at the time of the test it was already scaled up to a max size, don't know what it means but we haven't had k8s control plane issues (which might be the case for others as we have a large qps config as stated ahead) spark operator controller pod: Worker queue config Controller config Results - in general all ran smooth without delays and all apps (just some demo apps) finished successfully, but there were some interesting outcomes. CPU was used 100% at max (when submitted hundred apps in a span of a minute) For each app we had a service that measured the time between the spark app record creation and the actual pod creation. Also we checked the status for each app every 5 minutes and marked its final status. For the most part the time for app creation was sub zero seconds. In the most heavy workload it went up to 15 seconds. Also - some apps finished successfully but the state was stuck on running. We have a 'zombie killer' service in our prod env so it wasn't a concern for us, but it is probably something that should be investigated. |
@vara-bonthu Apologies for the delayed response. For us the hardware setup was 130 nodes in the cluster A single instance of Spark Operator was run on a single node with resource limits of 4 CPUs and 14 Gi memory. Next we ran different scenarios with the community version of SO and our modified version of Spark operator. For each scenario we submitted the same Spark job that computes the Pi value and runs for a little over ~ 100 seconds. The rate of submission was 1 job per second via a script. We measured the median times it takes for each job to complete from the time the SparkApplication CRD is created to the time it is marked as completed. Obviously, each job takes the same constant time to compute the value of Pi once scheduled but the time it sits in the SO queue can change with the load. The scenarios we ran were: 50 jobs at the rate of 1 job per second 900 jobs at the rate of 1 job per second We did run some other permutations of tests too, tinkering with command line params etc but each time we found the modified version of our SO that natively launches spark applications to outperform the community version with spark-submit. We have arrived at the following conclusion from our tests:
The previous work by bnetzi tried to address the single queue issue and our modification alleviates the problem by helping the SO increase its rate of consumption. Sorry if this is unneeded/too-much information but this was really the jist/crux of our experimentation with spark operator. Please let me know if you need any more information or clarification. |
@yuchaoran2011 @vara-bonthu - I was wondering if we have your blessings and buy-in for the proposed design? If yes, we can raise PRs for the feature. Also on our end we are working to open-source our native submit plugin. Thanks. |
@c-h-afzal - before addressing your important notes, I'll be happy to know if you modified the spark operator configuration itself before running the tests. As for your remarks: first of all - I agree with the conclusion - the main issue in spark operator is that it does not scale out, only scale up (vertically). I don't think my previous solution really solved it as it was only a bypass to make the single worker not get stuck on a single mutex (which was the case for the previous spark operator version) The current operator handles it with a much better worker threads mechanism, although it is CPU intensive mostly because of the JVM overhead. A real solution would be to separate between one component that would 'observe', and worker pods which will act. Probably one controller node and a bunch of worker pods that are communicating via an external queue and are scaling out with keda according to CPU and queue metrics. I believe your solution is important but it is just reducing CPU and memory usage, which is significant but at some point we get to the same bottleneck. Let me know what you think |
@bnetzi - so we changed the controller-thread config to a higher number. But given we had only 4 CPUs, increasing the number of threads beyond 4 wouldn't help (the CPUs we used weren't hyper-threaded) but we still set it to double digits. That's the primary config setting we changed. The config "maxTrackedExecutorPerApp" wouldn't be helpful as the Pi job spawns a single executor. We didn't experiment with the bucket* configs. Regarding the versions, we had a mix going-on here. We were using a very old beta version in prod that hadn't been updated to the latest changes. So we tested this beta version (I can find the exact one if you are interested). Then we had our modifications on a fork of this beta version that we tested. Then we also tested the community master branch with changes up to early Sept 2024. Then we finally tested a version with the PR you authored previously. By the way, we also tested with some JDK optimization flags to see if it would have an effect on JVM spin-up times but the changes were insignificant. Yes, I understand the mutex part you did but your work also indirectly addresses the single queue shortcoming, though I agree it wasn't the solution. Your idea for the solution definitely sounds promising but we'll to think through alternatives if any. This would likely be an involved re-write of the spark operator. Yes, you have hit the nail on the head! Our solution is just "alleviating" the problem but not solving it. And yes, eventually for a load high enough we'll end-up with the same bottleneck. Our solution best fits in the category of optimization. Appreciate you reading through and commenting :) |
What feature you would like to be added?
Ability for Spark Operator to adopt a provider-pattern/pluggable-mechanism to launch spark applications. A user can specify an option other than spark-submit to launch spark applications.
Why is this needed?
In the current implementation Spark Operator invokes spark-submit to launch a spark application in the cluster. From our testing we have determined that the penalty for the JVM spin-up causes significant increase in job latency when the cluster is under stress/heavy-load, i.e. the rate of spark applications being enqueued is higher than the rate of applications being dequeued causing the spark operator’s internal queue to swell-up and affect job latencies. We want to be able to launch spark application using native Go, without the JVM spin-up as part of spark-submit.
Describe the solution you would like
The solution we are proposing (and willing to contribute to, if consensus can be reached) is for the Spark Operator to allow changing the only mechanism (spark-submit) of launching spark applications to a user specified one. The default mechanism remains to be spark-submit. Users can specify their own plugin to launch spark applications a different way. Specifically, (here at Salesforce Big Data Org), in our fork, we create driver pods using Go and skip the JVM penalty. The work-around was devised by @ gangahiremath and mentioned in the issue#1574
Our work-around ports the functionality of spark-submit to Golang and significantly reduces the time it takes for a SparkApplication CRD object to be CREATED and then transition to the SUBMITTED state. If there’s enough interest in our approach, we plan to open-source Ganga's work-around too.
Describe alternatives you have considered
For improving latencies we have considered the pending PR which claims of performance boost by have a single queue per app. However, we have not realized the claimed performance enhancements in our testing. We still find JVM spin-up times to be the bottle-neck and hence the proposal.
Additional context
No response
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered: