Skip to content
This repository was archived by the owner on Feb 27, 2025. It is now read-only.

Spark 3.0 Support #15

Closed
rajmera3 opened this issue Jun 29, 2020 · 22 comments
Closed

Spark 3.0 Support #15

rajmera3 opened this issue Jun 29, 2020 · 22 comments
Labels
enhancement New feature or request high priority High Priority Item

Comments

@rajmera3
Copy link
Contributor

No description provided.

@rajmera3
Copy link
Contributor Author

rajmera3 commented Jul 8, 2020

On initial inspection, the issue with Spark 3.0 support seems to be a logging class in the connector. If replaced, the connector should function

@ravikd744
Copy link

Hi Rahul, does the latest release support spark 3.0.0?

@rajmera3
Copy link
Contributor Author

rajmera3 commented Aug 5, 2020

@ravikd744 No, not yet. There is a PR in the works for Spark 3.0. Once it has been validated, we will update the repository, build, and readMe with the new support statement.

@dbeavon
Copy link

dbeavon commented Aug 6, 2020

Thanks for working on this. We are eager to start using Spark 3 (in Databricks 7). There are lots of factors pushing us in that direction, and the lack of a SQL connector seems to the be only holdup at this time.

@shivsood
Copy link
Collaborator

PR #30 is addressing this.

@briancuster
Copy link

It would be really great if this connector supported 3.0. We are currently locked in to using 3.0 but would like to use this connector.

@tkasu
Copy link

tkasu commented Sep 14, 2020

Any update regarding this? This is major blocker for us.

@sl2bigdata
Copy link

Would be really nice to have the upgrade! Blocker for us too.. Thx guys

@dbeavon
Copy link

dbeavon commented Sep 16, 2020

For those of you who are azure-databricks customers, and are loading data into azure-sql, would you please contact tech support at Microsoft?

There is no doubt that this is a breaking change for anyone who must upgrade to the azure-databricks runtime 7.x. In the very least they could provide a warning for us in the release notes.

For some reason the azure-databricks team needs a bit of encouragement from us before they'll prioritize a fix in this connector. They don't seem to consider it a priority to support the fast, bulk-insert connector for SQL. Currently they consider this a "third-party" interface. That same opinion seems to be expressed by both the "azure-databricks" team and the "databricks" team. It's odd that they don't really understand the requirement to be able to bulk insert from spark dataframes.... All you need is to google "spark sql bulk insert"....

Bulk insert technology in SQL Server has been around for decades, and Spark has a significant need of it. Otherwise we run into some some silly and unnecessary bottlenecks on individual record insertions.

@dbeavon
Copy link

dbeavon commented Oct 7, 2020

Sorry to state the obvious, but my understanding is that this issue is being delayed. It won't get much attention until "SQL Server Big Data Clusters" (SSBDC) is ready to adopt spark 3.0.

I don't know much about it... can someone please point me to a roadmap for SQL Server Big Data Clusters? Am I right that it does not support spark 3.0 yet? How long until its customers will be ready to use spark 3.0?

As far as azure-databricks goes, those guys don't seem to care much about this connector... or at least they are not in a position to ask for a connector which is compatible with spark 3.0. So azure-databricks customers are forced to wait for SSBDC to catch up.... hopefully that won't be very much longer!

@rajmera3 rajmera3 added high priority High Priority Item and removed high priority High Priority Item labels Oct 7, 2020
@rajmera3
Copy link
Contributor Author

rajmera3 commented Oct 7, 2020

Hi all,

Thanks for the comments and your feedback is received.

Currently we do not have the necessary validation to confirm Spark 3.0 support. Before adding the functionality and creating a new version of the connector (a dedicated 3.0 version), we look to do performance testing, runtime compatibility, etc.

At this time we have no strict timeline for Spark 3.0 support. There is an open PR and fork that allows the connector to work with 3.0 as reported by a few customers, but we will refrain from officially moving it into the main branch until we have tested it thoroughly.

We hear your feedback and hope to address it sooner than later.

@traberc
Copy link

traberc commented Oct 7, 2020

What is the issue with Spark 3.0 support? I see comments complaining about Databricks, but is the issue with Databricks itself or Spark 3.0? This being a Microsoft connector, it seems that the onus lies with Microsoft to update the connector rather than with Databricks. Maybe someone can help me understand the technical issues with Spark 3.0 support.

Now that the old "azure-sqldb-spark" connector is out of support, this "sql-spark-connector" is basically the only option going forward, but without Spark 3.0 support, it's basically dead in the water too.

We really want to leverage the new performance features of Spark 3.0, like AQP, but are being held back by either of the available SQL server connector options provided by Microsoft.

@dbeavon
Copy link

dbeavon commented Oct 8, 2020

@traberc
To see the necessary code changes you can go look at the PR (#30). There are a few lines of changes.

There is no real issue other than regression testing (aka "necessary validation").

The only substantial programming change is to target a newer version of scala.

In order for you to get this connector working you need to download the code, open in intellij, remove tests, and edit the sbt to target the correct version of scala, and rebuild. Once this is done, you will have your own private copy of the module that should work fine. But you will have nobody else to support it. This is where I landed after many conversations with folks at databricks, azure-databricks, and conversations here in the connector project.

I think what Rahul is saying is that databricks is not in his wheelhouse. I think it is fair to say that this community will start to care more about the topic (spark 3.0 support) once SSBDC is ready to adopt spark 3.0, and not before. You can read more at https://github.com/microsoft/sql-spark-connector

It is frustrating how hard it is for Microsoft to acknowledge that their "azure databricks" needs to properly interoperate with "azure SQL". IMHO this should not be a months-long debate. Another thing that Microsoft won't acknowledge is that this is a regression (as you pointed out). By definition, this is a regression in azure-databricks since we had a bulk-load spark connector in 2.4 and after upgrading to 3.0 we do not.

Things seem especially dysfunctional because there are three separate parties involved and everybody is dodging responsibility. The formal reasoning why databricks is dodging is because this is considered a "third-party" library. In addition to databricks itself, there is also another large team at Microsoft called "azure-databricks" and they do a bit of the software development to ensure databricks can be called a "first-party" service in azure. They build the "glue" that holds databricks in place within the azure cloud. They are also responsible for taking support calls. If these two teams ("databricks" and "azure-databricks") weren't enough, there is yet another team here in the community that is responsible for this connector. And this community project seems to be much more interested in SSBDC than in databricks. I've spent several months being bounced back and forth between these three different sets of folks. I strongly suggest you just be patient and wait for SSBDC to mature a bit more. Otherwise you are likely to waste as much time on the topic as I have.

In addition to waiting for SSBDC to mature, I am eagerly looking forward to seeing how "Synapse Workspaces" will support the interaction between spark and SQL . I can't imagine they won't have a bulk load connector. And they can't really avoid offering full support (like we are seeing with azure-databricks). Moreover it is very possible that whatever connector they create will be compatible with spark 3.0 (in databricks), so you will have an avenue to get support when you get in a pinch.

@gmdiana-hershey
Copy link

I'm not an expert, so hopefully you'll all forgive me for asking a basic question. What's unclear to me is what the "necessary validation" means? It sounded like a number of customers have been building and using the existing PR and using it successfully. Are there specific test cases that the PR doesn't pass? If so, what is causing the delay in resolving those failures and completing the testing work?

As an Azure Databricks customer, it's been very frustrating that Microsoft has built a connector is incompatible with the current major release of Spark. On one hand, they're offering two products - SQL Server and Databricks (with runtime 7.0+). Both of these are allegedly "Azure" cloud services that Microsoft endorses, and one would think that would include the runtime releases of both those products. On the other hand, they've failed to provide a connector that lets you use the two products together. The lack of movement here has prompted me to begin exploring alternative databases.

@B4PJS
Copy link

B4PJS commented Oct 20, 2020

@dbeavon

In addition to waiting for SSBDC to mature, I am eagerly looking forward to seeing how "Synapse Workspaces" will support the interaction between spark and SQL . I can't imagine they won't have a bulk load connector. And they can't really avoid offering full support (like we are seeing with azure-databricks). Moreover it is very possible that whatever connector they create will be compatible with spark 3.0 (in databricks), so you will have an avenue to get support when you get in a pinch.

Synapse workspaces currently only use scala to connect with synapse sql and only allows loading into a new table. It uses polybase under the hood as opposed to bulkcopy so that will not help out here.

Engineering team have been given feedback about this and they hope to have both points fixed at some point....

@wboleksii
Copy link

@rajmera3 Azure Databricks 6.6 (the last one with Spark 2.x) is set for EOL on Nov 26. This is very critical issue at this point

@pmooij
Copy link

pmooij commented Nov 21, 2020

@rajmera3 Azure Databricks 6.6 (the last one with Spark 2.x) is set for EOL on Nov 26. This is very critical issue at this point

So it's high prio now! Looking forward running this on the latest DBR, as 7.4 has sooo many improvements over 6.6

@dazfuller
Copy link

dazfuller commented Nov 22, 2020

Spark 3 is critical, but it's worth nothing that Databricks runtime 6.4 which uses spark 2.4.5 goes EOL April 1st 2021 (poor choice of date).

Azure Databricks Runtimes

@MrWhiteABEX
Copy link

Again I can only recommend to just compile it yourself from the PR and test it. It is not difficult using sbt. The CI build failes due to the broken pipeline but the connector works just fine for me. I have a streaming application running in production for about a month on DBR 7.3 that continuously ingests data without issues. At least for the sink with default options I am quite confident to say that if there was a major issue I would have hit it. But you have to test it in your dev/qa environment anyways.
The automated tests of this connector are lackluster. They would not detect any incompatibilities in Scala or Spark versions. They hardly scratch the surface.
Spark 3 is a major improvement. For my workload using DBR 7.3 (coming from 6.5) allowed me reduce my job cluster size and save about 1500$ (abount 30-40%) per month. At least for me it was worth the additional effort.

@pmooij
Copy link

pmooij commented Dec 9, 2020

I've made the move to build the (fat) JAR myself as well, it was actually easier then expected with the following command lines;

  1. choco install intellijidea-community
  2. choco install sbt
  3. sbt assembly

this has been running smooth at Databrick Runtime 7.4 | Spark 3.1 over the last few days

since #30 was already opened in july and improvements have taken place in the master since then - like the computed column fix we rely on - I had to create a new branch based on master and just pasted in the build.sbt file from #30. with that, I have best of both.

thanks for the tip @MrWhiteABEX

@ZMon3y
Copy link

ZMon3y commented Jan 8, 2021

I've made the move to build the (fat) JAR myself as well, it was actually easier then expected with the following command lines;

  1. choco install intellijidea-community
  2. choco install sbt
  3. sbt assembly

this has been running smooth at Databrick Runtime 7.4 | Spark 3.1 over the last few days

since #30 was already opened in july and improvements have taken place in the master since then - like the computed column fix we rely on - I had to create a new branch based on master and just pasted in the build.sbt file from #30. with that, I have best of both.

Thanks @pmooij
This worked for me as well with one minor change to src/test/scala/com/microsoft/sqlserver/jdbc/spark/bulkwrite/DataSourceTest.scala

which can be seen here: master...dovijoel:spark-3.0

Basically just changing SharedSQLContext to SharedSparkSession

I'm having success in Databricks Runtime 7.3 LTS | Spark 3.0.1 | Scala 2.12

@rajmera3
Copy link
Contributor Author

Hi all,

Thanks for the patience as we worked on supporting Spark 3.0.
We have released a preview version of an Apache Spark 3.0 compatible connector on Maven!
The readme has more information but the connector is available at the coordinates com.microsoft.azure:spark-mssql-connector_2.12_3.0:1.0.0-alpha.

If you notice any bugs or have any feedback, please file an issue!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request high priority High Priority Item
Projects
None yet
Development

No branches or pull requests