Spark 3.5.3 support #1178

grazy27 · 2024-07-08T20:50:28Z

Changes:

Implemented compatibility with Spark 3.5.3 (fixes included in a separate commit).
Updated project dependencies.
Fixed several small bugs, including:
- Null reference exceptions.
- Handling Windows paths with spaces.
- Exceptions when after job completion.
- A few more issues when running locally and on Databricks.

Tested with:

Spark:

Spark 3.5.0 on Databricks 14.3: Works, see comment bellow.
Spark 3.5.1 on Windows
Spark 3.5.2 on Windows
Spark 3.5.3 on Windows

Databricks:

Fails on 15.4:
The following error occurs:

[Error] [JvmBridge] JVM method execution failed: Static method 'createPythonFunction' failed for class 'org.apache.spark.sql.api.dotnet.SQLUtils' when called with 7 arguments ([Index=1, Type=Byte[], Value=System.Byte[]], [Index=2, Type=Hashtable, Value=Microsoft.Spark.Interop.Internal.Java.Util.Hashtable], [Index=3, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=4, Type=String, Value=Microsoft.Spark.Worker], [Index=5, Type=String, Value=2.1.1.0], [Index=6, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=7, Type=null, Value=null])
[2024-09-13T10:47:53.1569404Z] [machine] [Error] [JvmBridge] java.lang.NoSuchMethodError: org.apache.spark.api.python.SimplePythonFunction.<init>(Lscala/collection/Seq;Ljava/util/Map;Ljava/util/List;Ljava/lang/String;Ljava/lang/String;Ljava/util/List;Lorg/apache/spark/api/python/PythonAccumulatorV2;)V
	at org.apache.spark.sql.api.dotnet.SQLUtils$.createPythonFunction(SQLUtils.scala:35)
	at org.apache.spark.sql.api.dotnet.SQLUtils.createPythonFunction(SQLUtils.scala)

Works on 14.3:
Tested on Databricks 14.3, and it works. However, there is a missing functionality for Vector UDFs.
Since UseArrow is always set to true on Databricks, Vector UDFs do not function properly and can crash the entire job. This occurs because Spark splits a single expected RecordBatch into a collection of smaller batches, while the code assumes a single batch.
Relevant Spark settings: useArrow, maxRecordsPerBatch.

Affected Tickets:

grazy27 · 2024-07-08T20:55:41Z

@dotnet-policy-service agree

GeorgeS2019 · 2024-07-22T04:49:47Z

@grazy27

Can you share how many of the unit tests pass

The UDF unit tests have not been updated.
Are you able to get all of them to pass?

grazy27 · 2024-07-22T11:56:08Z

@grazy27

Can you share how many of the unit tests pass

The UDF unit tests have not been updated. Are you able to get all of them to pass?

Hello @GeorgeS2019 , they do.

Saw your issue, probably my env uses UTF8 by default.
Several tests fail from time to time with executor driver): java.nio.file.NoSuchFileException: C:\Users\grazy27\AppData\Local\Temp\spark-cc2cf7bc-3c8c-4fdf-a496-266424de943d\userFiles-92d122bb-af9a-40ea-a430-131454afc705\archive.zip
But they pass if run second time, so I didn't dive deeper

travis-leith · 2024-08-26T07:02:39Z

What is the status of this PR?

grazy27 · 2024-08-26T07:14:50Z

What is the status of this PR?

It works, the tests pass, and performance-wise, it's the best solution I've found for integrating .NET with Spark. The next steps are on Microsoft's side.

I'm also working on implementing CoGrouped UDFs, and I plan to push those updates here as well

GeorgeS2019 · 2024-08-26T07:17:02Z

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF after making adjustment to migrate to .net6.

#796

https://github.com/Apress/introducing-.net-for-apache-spark/tree/main/ch04/Chapter4

travis-leith · 2024-08-26T07:18:34Z

The next steps are on Microsoft's side.

Any idea who is "in charge" of this repo?

grazy27 · 2024-08-26T07:30:27Z

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF.

#796

I can take a look, but only if a lonely evening with bad weather rolls around :) No promises, as this isn’t my primary focus.

There are two suggestions from developers that might help. The first is for a separate code cell, and the second is for a separate environment variable. Have you tried both approaches, and does the issue still persist?

wudanzy · 2024-11-25T02:25:44Z

Hi Ihor (@grazy27), thanks for the contribution! I recently get the write permission of this repo and happy to move this forward. Due to limited bandwidth in our team and other priorities, we don't have concrete work items on this project. But we can review your code and let's work together to move this forward!

grazy27 · 2024-11-25T07:10:26Z

Hi Ihor (@grazy27), thanks for the contribution! I recently get the write permission of this repo and happy to move this forward. Due to limited bandwidth in our team and other priorities, we don't have concrete work items on this project. But we can review your code and let's work together to move this forward!

Hello Dan <@wudanzy>,

That's fantastic news—great to hear!

I'd be happy to help with a few more issues to get this project back on track. In my opinion, the most important ones are:

Support for UDFs when UseArrow = true
Migrating to a standalone NuGet package for BinarySerializer and upgrading the solution to .NET 9
Addressing the bug with Databricks 15.4

wudanzy · 2024-11-25T10:05:35Z

Thanks for sharing that!

wudanzy

Can we split this PR a little bit? Which can speed up the review.

docs/building/windows-instructions.md

wudanzy · 2024-11-26T06:10:12Z

src/csharp/Microsoft.Spark.E2ETest/IpcTests/SparkContextTests.cs

+            }
+            catch (Exception)
+            {
+                // It tries to delete non-existent file, but other from that its ok


Are those exceptions excepted? If we expect them, we could add some logic here, if not, we could fail the test in such cases.

My logic here is that since nothing related to this API changed inside Dotnet.Spark, and it just calls AddArchive on JVM sparkContext, and archive is added successfully - it must be an internal bug in spark itself.
I tested it with Scala directly, and it fails with the same exception.
I plan to test it more and report to Spark later.

Actually it is the exact reason why other tests fail when run together with it:

I restricted this test to (3.1..3.2), versions on which it had already been tested on

src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/CatalogTests.cs

src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/DataFrameTests.cs

src/csharp/Microsoft.Spark.E2ETest/SparkFixture.cs

src/csharp/Microsoft.Spark.UnitTest/BinarySerDeTests.cs

wudanzy · 2024-11-26T14:48:08Z

src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs

@@ -839,4 +838,143 @@ private CommandExecutorStat ExecuteDataFrameGroupedMapCommand(
            return stat;
        }
    }
+
+    internal class ArrowOrDataFrameCoGroupedMapCommandExecutor : ArrowOrDataFrameSqlCommandExecutor


Hi Igor, is it possible to split this PR a little bit? This PR is huge, better to only contain 3.5 support and Net 8 support. We can leave other changes to future PRs.

Sure, Dan. I'll create a separate PR for CoGrouped UDFs and binary serializer. Vast majority of other fixes are related to each other though, so PR will still be relatively large.
At one point, I was unsure if this would ever get merged, so I ended up including all the improvements I needed to properly test whether the library meets my requirements in one place, so that if someone wants to build a version it's relatively simple to accomplish.

@wudanzy

#1178 (comment)

It will help this project if the support of polyglot notebook is included
#1178 (comment)

Possible to setup CI/CD so that each PR usability can be tracked?
@SparkSnail
https://github.com/SparkSnail/spark/actions

Possible to setup CI/CD so that each PR usability can be tracked? @SparkSnail https://github.com/SparkSnail/spark/actions

Yes, current test pipeline for the repo is broken, we are working to recover pipeline for PRs.

Removed unnecessary refactoring and new features from the PR, preserved only .NET, Spark 3.5 and a few fixes

src/csharp/Microsoft.Spark.Worker/Processor/TaskContextProcessor.cs

wudanzy

Hi @grazy27, I got a basic idea of what is changed, overall, it looks good to me. One thing I found that the scala files content are not changed too much Could you please see if you can move the files instead of adding new ones, which helps highlight what is changed.

src/csharp/Microsoft.Spark.E2ETest/Microsoft.Spark.E2ETest.csproj

src/csharp/Microsoft.Spark.UnitTest/Microsoft.Spark.UnitTest.csproj

src/csharp/Microsoft.Spark/Sql/Catalog/Catalog.cs

wudanzy · 2024-11-28T12:15:55Z

src/scala/microsoft-spark-3-5/pom.xml

@@ -0,0 +1,91 @@
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"


Is this file copied from src/scala/microsoft-spark-3-2/pom.xml? Can you try to move it and then modify? Similar to this: 80c745b

It highlights what is changed.

Sure, it's already done, but as a separate commit: e7eccdf. Please let me know if that's ok.
There are 4 commits in total, for copypaste, for 3.5.1 fixes, for .net8 and for databricks fixes

Looks good.

grazy27 · 2024-12-29T15:01:52Z

@wudanzy done, let's see if it passes e2e tests with 2.4.
I'll create a separate pr for .net and arcade after this one is merged

wudanzy · 2024-12-29T23:33:22Z

/AzurePipelines run

azure-pipelines · 2024-12-29T23:33:49Z

Azure Pipelines successfully started running 1 pipeline(s).

wudanzy · 2024-12-29T23:45:53Z

Build has passed, let's wait for tests.

wudanzy · 2024-12-30T01:34:53Z

Failed for 3.5.x tests, spark distributions are not downloaded.
error.log

wudanzy · 2024-12-30T01:44:38Z

We have to change https://github.com/dotnet/spark/blob/main/azure-pipelines-e2e-tests-template.yml. For 3.5.x, it used hadoop 3 instead of hadoop 2.7.

wudanzy · 2024-12-30T01:59:48Z

Hi @grazy27, I also found that we need winutils for 3.5.x. I will take a look to see how to compile one today.

wudanzy · 2024-12-30T03:28:29Z

Oh, I mistake the above versions for spark versions, it is actually hadoop versions, so we already have that. So it would work if you change https://github.com/dotnet/spark/blob/main/azure-pipelines-e2e-tests-template.yml

grazy27 · 2024-12-30T09:50:55Z

On it

grazy27 · 2024-12-30T13:15:14Z

@SparkSnail Hello! You may have more background in this, what's your opinion? Is there any limitations I'm missing, or its worth giving it a try?

I'm troubleshooting failing build pipeline, and can see that only for 3.3.3 there's a hadoop3 bin, but for 3.3.4 there's 2.

Reference code:

Is it possible to switch to using hadoop 3 for 3.3.0+?
I tested, and all these links work:

If so, I'd like to try keeping only one if: if ver >= 3.3.0, and then we can use powershell with version type comparison, like

wudanzy · 2024-12-30T13:36:47Z

@SparkSnail is on vocation, I think we can first have a try.

…ests

grazy27 · 2024-12-30T13:48:27Z

Wise decision, thanks for the update :)

Committed usage of hadoop3 for 3.3.0+

wudanzy · 2024-12-30T14:29:21Z

/AzurePipelines run

azure-pipelines · 2024-12-30T14:29:52Z

Azure Pipelines successfully started running 1 pipeline(s).

wudanzy · 2024-12-30T15:23:38Z

The tests passed! Thanks for the investigation and fix! It is too late today, I will review tomorrow.

wudanzy · 2024-12-31T01:21:03Z

src/csharp/Microsoft.Spark.Worker/Processor/TaskContextProcessor.cs

            StageId = SerDe.ReadInt32(stream),
            PartitionId = SerDe.ReadInt32(stream),
            AttemptNumber = SerDe.ReadInt32(stream),
            AttemptId = SerDe.ReadInt64(stream),
        };

+        // Needed for 3.3.0+
+        // https://issues.apache.org/jira/browse/SPARK-36173
        private static TaskContext ReadTaskContext_3_3(Stream stream)


I am wondering if ReadTaskContext_3_3 can rely on ReadTaskContext_2_x.

Yep, it would be nice to reuse common logic, let's have a look at it after removing support for the obsolete spark versions

wudanzy

LGTM! And let's wait for @SparkSnail for another review.

grazy27 · 2025-01-01T09:56:30Z

LGTM! And let's wait for @SparkSnail for another review.

Thanks, Dan, and happy New Year!

wudanzy · 2025-01-01T10:58:43Z

Thanks, happy new year!

SparkSnail · 2025-01-02T02:44:34Z

@SparkSnail Hello! You may have more background in this, what's your opinion? Is there any limitations I'm missing, or its worth giving it a try?

I'm troubleshooting failing build pipeline, and can see that only for 3.3.3 there's a hadoop3 bin, but for 3.3.4 there's 2.

Reference code:

Is it possible to switch to using hadoop 3 for 3.3.0+? I tested, and all these links work:

https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz

https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz

https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz

https://archive.apache.org/dist/spark/spark-3.3.3/spark-3.3.3-bin-hadoop3.tgz

https://archive.apache.org/dist/spark/spark-3.3.4/spark-3.3.4-bin-hadoop3.tgz

https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

If so, I'd like to try keeping only one if: if ver >= 3.3.0, and then we can use powershell with version type comparison, like

The change to use Hadoop3 in Spark 3.3.3 is because they doesn't provide hadoop3 binary downloading in official website, and there is Hadoop2 binary from Spark 3.3.4, hadoop3 is only a workaround for Spark3.3.3 and still use Hadoop2 binary to keep align with previous versions. I'm fine to upgrade to Hadoop3 if Spark 3.5 also need Hadoop3.

wudanzy · 2025-01-02T04:10:27Z

Hi @grazy27, I remembered that you preferred rebase and merge, but that was disabled, and I didn't find a way to enable it.

wudanzy · 2025-01-02T04:13:16Z

Merged, thanks for your contribution! @grazy27

grazy27 changed the title ~~Spark 3.5.1, .NET 8, Dependencies and documentation~~ Spark 3.5.1, .NET 8, Dependencies and Documentation Jul 8, 2024

GeorgeS2019 mentioned this pull request Jul 22, 2024

[BUG]: [Spark.NET 3.5.1] Unable to get Charset 'cp65001' for property 'sun.stderr.encoding' #1180

Closed

grazy27 changed the title ~~Spark 3.5.1, .NET 8, Dependencies and Documentation~~ Spark 3.5.3, .NET 8, CoGrouped UDFs, Fixes, Dependencies and Documentation Nov 23, 2024

grazy27 mentioned this pull request Nov 23, 2024

[FEATURE REQUEST]: #1184

Closed

wudanzy requested review from SparkSnail and wudanzy November 25, 2024 02:03

wudanzy assigned grazy27 Nov 25, 2024

wudanzy added the enhancement New feature or request label Nov 25, 2024

grazy27 force-pushed the main branch from e42631e to 0fe97fe Compare November 25, 2024 08:09

wudanzy reviewed Nov 26, 2024

View reviewed changes

SparkSnail reviewed Nov 27, 2024

View reviewed changes

src/csharp/Microsoft.Spark.Worker/Processor/TaskContextProcessor.cs Outdated Show resolved Hide resolved

grazy27 force-pushed the main branch from 0fe97fe to 992daf4 Compare November 28, 2024 09:54

grazy27 changed the title ~~Spark 3.5.3, .NET 8, CoGrouped UDFs, Fixes, Dependencies and Documentation~~ Spark 3.5.3, .NET 8, Dependencies Nov 28, 2024

grazy27 force-pushed the main branch 3 times, most recently from 333ea16 to 9e30f44 Compare November 28, 2024 10:57

grazy27 requested review from SparkSnail and wudanzy November 28, 2024 11:05

wudanzy reviewed Nov 28, 2024

View reviewed changes

grazy27 changed the title ~~Spark 3.5.3, .NET 8, Dependencies~~ Spark 3.5.x support Dec 29, 2024

grazy27 force-pushed the main branch from 5624fd5 to e4a378e Compare December 29, 2024 15:00

Updated CI and Nightly pipelines with Spark 3.5, fixed incompatible t…

9502a69

…ests

grazy27 force-pushed the main branch from e4a378e to 9502a69 Compare December 30, 2024 13:46

wudanzy reviewed Dec 31, 2024

View reviewed changes

wudanzy approved these changes Dec 31, 2024

View reviewed changes

SparkSnail approved these changes Jan 2, 2025

View reviewed changes

wudanzy merged commit 7b9224a into dotnet:main Jan 2, 2025
54 checks passed

grazy27 changed the title ~~Spark 3.5.x support~~ Spark 3.5.3 support Jan 3, 2025

grazy27 mentioned this pull request Jan 3, 2025

[FEATURE REQUEST]: Update Arcade version to allow updating .NET #1198

Closed

		@@ -0,0 +1,91 @@
		<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

Spark 3.5.3 support #1178

Spark 3.5.3 support #1178

Conversation

grazy27 commented Jul 8, 2024 • edited Loading

Changes:

Tested with:

Spark:

Databricks:

Affected Tickets:

grazy27 commented Jul 8, 2024

GeorgeS2019 commented Jul 22, 2024 • edited Loading

grazy27 commented Jul 22, 2024

travis-leith commented Aug 26, 2024

grazy27 commented Aug 26, 2024

GeorgeS2019 commented Aug 26, 2024 • edited Loading

travis-leith commented Aug 26, 2024

grazy27 commented Aug 26, 2024

wudanzy commented Nov 25, 2024

grazy27 commented Nov 25, 2024

wudanzy commented Nov 25, 2024

wudanzy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grazy27 Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GeorgeS2019 Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wudanzy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grazy27 Nov 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grazy27 commented Dec 29, 2024

wudanzy commented Dec 29, 2024

azure-pipelines bot commented Dec 29, 2024

wudanzy commented Dec 29, 2024 • edited Loading

wudanzy commented Dec 30, 2024

wudanzy commented Dec 30, 2024

wudanzy commented Dec 30, 2024

wudanzy commented Dec 30, 2024

grazy27 commented Dec 30, 2024

grazy27 commented Dec 30, 2024 • edited Loading

wudanzy commented Dec 30, 2024

grazy27 commented Dec 30, 2024

wudanzy commented Dec 30, 2024

azure-pipelines bot commented Dec 30, 2024

wudanzy commented Dec 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wudanzy left a comment

Choose a reason for hiding this comment

grazy27 commented Jan 1, 2025

wudanzy commented Jan 1, 2025

SparkSnail commented Jan 2, 2025

wudanzy commented Jan 2, 2025

wudanzy commented Jan 2, 2025

grazy27 commented Jul 8, 2024 •

edited

Loading

GeorgeS2019 commented Jul 22, 2024 •

edited

Loading

GeorgeS2019 commented Aug 26, 2024 •

edited

Loading

grazy27 Nov 28, 2024 •

edited

Loading

GeorgeS2019 Nov 27, 2024 •

edited

Loading

grazy27 Nov 30, 2024 •

edited

Loading

wudanzy commented Dec 29, 2024 •

edited

Loading

grazy27 commented Dec 30, 2024 •

edited

Loading