-
Notifications
You must be signed in to change notification settings - Fork 125
Spark 3.0 Support #15
Comments
On initial inspection, the issue with Spark 3.0 support seems to be a logging class in the connector. If replaced, the connector should function |
Hi Rahul, does the latest release support spark 3.0.0? |
@ravikd744 No, not yet. There is a PR in the works for Spark 3.0. Once it has been validated, we will update the repository, build, and readMe with the new support statement. |
Thanks for working on this. We are eager to start using Spark 3 (in Databricks 7). There are lots of factors pushing us in that direction, and the lack of a SQL connector seems to the be only holdup at this time. |
PR #30 is addressing this. |
It would be really great if this connector supported 3.0. We are currently locked in to using 3.0 but would like to use this connector. |
Any update regarding this? This is major blocker for us. |
Would be really nice to have the upgrade! Blocker for us too.. Thx guys |
For those of you who are azure-databricks customers, and are loading data into azure-sql, would you please contact tech support at Microsoft? There is no doubt that this is a breaking change for anyone who must upgrade to the azure-databricks runtime 7.x. In the very least they could provide a warning for us in the release notes. For some reason the azure-databricks team needs a bit of encouragement from us before they'll prioritize a fix in this connector. They don't seem to consider it a priority to support the fast, bulk-insert connector for SQL. Currently they consider this a "third-party" interface. That same opinion seems to be expressed by both the "azure-databricks" team and the "databricks" team. It's odd that they don't really understand the requirement to be able to bulk insert from spark dataframes.... All you need is to google "spark sql bulk insert".... Bulk insert technology in SQL Server has been around for decades, and Spark has a significant need of it. Otherwise we run into some some silly and unnecessary bottlenecks on individual record insertions. |
Sorry to state the obvious, but my understanding is that this issue is being delayed. It won't get much attention until "SQL Server Big Data Clusters" (SSBDC) is ready to adopt spark 3.0. I don't know much about it... can someone please point me to a roadmap for SQL Server Big Data Clusters? Am I right that it does not support spark 3.0 yet? How long until its customers will be ready to use spark 3.0? As far as azure-databricks goes, those guys don't seem to care much about this connector... or at least they are not in a position to ask for a connector which is compatible with spark 3.0. So azure-databricks customers are forced to wait for SSBDC to catch up.... hopefully that won't be very much longer! |
Hi all, Thanks for the comments and your feedback is received. Currently we do not have the necessary validation to confirm Spark 3.0 support. Before adding the functionality and creating a new version of the connector (a dedicated 3.0 version), we look to do performance testing, runtime compatibility, etc. At this time we have no strict timeline for Spark 3.0 support. There is an open PR and fork that allows the connector to work with 3.0 as reported by a few customers, but we will refrain from officially moving it into the main branch until we have tested it thoroughly. We hear your feedback and hope to address it sooner than later. |
What is the issue with Spark 3.0 support? I see comments complaining about Databricks, but is the issue with Databricks itself or Spark 3.0? This being a Microsoft connector, it seems that the onus lies with Microsoft to update the connector rather than with Databricks. Maybe someone can help me understand the technical issues with Spark 3.0 support. Now that the old "azure-sqldb-spark" connector is out of support, this "sql-spark-connector" is basically the only option going forward, but without Spark 3.0 support, it's basically dead in the water too. We really want to leverage the new performance features of Spark 3.0, like AQP, but are being held back by either of the available SQL server connector options provided by Microsoft. |
@traberc There is no real issue other than regression testing (aka "necessary validation"). The only substantial programming change is to target a newer version of scala. In order for you to get this connector working you need to download the code, open in intellij, remove tests, and edit the sbt to target the correct version of scala, and rebuild. Once this is done, you will have your own private copy of the module that should work fine. But you will have nobody else to support it. This is where I landed after many conversations with folks at databricks, azure-databricks, and conversations here in the connector project. I think what Rahul is saying is that databricks is not in his wheelhouse. I think it is fair to say that this community will start to care more about the topic (spark 3.0 support) once SSBDC is ready to adopt spark 3.0, and not before. You can read more at https://github.com/microsoft/sql-spark-connector It is frustrating how hard it is for Microsoft to acknowledge that their "azure databricks" needs to properly interoperate with "azure SQL". IMHO this should not be a months-long debate. Another thing that Microsoft won't acknowledge is that this is a regression (as you pointed out). By definition, this is a regression in azure-databricks since we had a bulk-load spark connector in 2.4 and after upgrading to 3.0 we do not. Things seem especially dysfunctional because there are three separate parties involved and everybody is dodging responsibility. The formal reasoning why databricks is dodging is because this is considered a "third-party" library. In addition to databricks itself, there is also another large team at Microsoft called "azure-databricks" and they do a bit of the software development to ensure databricks can be called a "first-party" service in azure. They build the "glue" that holds databricks in place within the azure cloud. They are also responsible for taking support calls. If these two teams ("databricks" and "azure-databricks") weren't enough, there is yet another team here in the community that is responsible for this connector. And this community project seems to be much more interested in SSBDC than in databricks. I've spent several months being bounced back and forth between these three different sets of folks. I strongly suggest you just be patient and wait for SSBDC to mature a bit more. Otherwise you are likely to waste as much time on the topic as I have. In addition to waiting for SSBDC to mature, I am eagerly looking forward to seeing how "Synapse Workspaces" will support the interaction between spark and SQL . I can't imagine they won't have a bulk load connector. And they can't really avoid offering full support (like we are seeing with azure-databricks). Moreover it is very possible that whatever connector they create will be compatible with spark 3.0 (in databricks), so you will have an avenue to get support when you get in a pinch. |
I'm not an expert, so hopefully you'll all forgive me for asking a basic question. What's unclear to me is what the "necessary validation" means? It sounded like a number of customers have been building and using the existing PR and using it successfully. Are there specific test cases that the PR doesn't pass? If so, what is causing the delay in resolving those failures and completing the testing work? As an Azure Databricks customer, it's been very frustrating that Microsoft has built a connector is incompatible with the current major release of Spark. On one hand, they're offering two products - SQL Server and Databricks (with runtime 7.0+). Both of these are allegedly "Azure" cloud services that Microsoft endorses, and one would think that would include the runtime releases of both those products. On the other hand, they've failed to provide a connector that lets you use the two products together. The lack of movement here has prompted me to begin exploring alternative databases. |
Synapse workspaces currently only use scala to connect with synapse sql and only allows loading into a new table. It uses polybase under the hood as opposed to bulkcopy so that will not help out here. Engineering team have been given feedback about this and they hope to have both points fixed at some point.... |
@rajmera3 Azure Databricks 6.6 (the last one with Spark 2.x) is set for EOL on Nov 26. This is very critical issue at this point |
So it's high prio now! Looking forward running this on the latest DBR, as 7.4 has sooo many improvements over 6.6 |
Spark 3 is critical, but it's worth nothing that Databricks runtime 6.4 which uses spark 2.4.5 goes EOL April 1st 2021 (poor choice of date). |
Again I can only recommend to just compile it yourself from the PR and test it. It is not difficult using sbt. The CI build failes due to the broken pipeline but the connector works just fine for me. I have a streaming application running in production for about a month on DBR 7.3 that continuously ingests data without issues. At least for the sink with default options I am quite confident to say that if there was a major issue I would have hit it. But you have to test it in your dev/qa environment anyways. |
I've made the move to build the (fat) JAR myself as well, it was actually easier then expected with the following command lines;
this has been running smooth at Databrick Runtime 7.4 | Spark 3.1 over the last few days since #30 was already opened in july and improvements have taken place in the master since then - like the computed column fix we rely on - I had to create a new branch based on master and just pasted in the build.sbt file from #30. with that, I have best of both. thanks for the tip @MrWhiteABEX |
Thanks @pmooij which can be seen here: master...dovijoel:spark-3.0 Basically just changing SharedSQLContext to SharedSparkSession I'm having success in Databricks Runtime 7.3 LTS | Spark 3.0.1 | Scala 2.12 |
Hi all, Thanks for the patience as we worked on supporting Spark 3.0. If you notice any bugs or have any feedback, please file an issue! |
No description provided.
The text was updated successfully, but these errors were encountered: