-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.5.3 support #1178
Spark 3.5.3 support #1178
Conversation
@dotnet-policy-service agree |
Can you share how many of the unit tests pass The UDF unit tests have not been updated. |
Hello @GeorgeS2019 , they do. Saw your issue, probably my env uses UTF8 by default. |
What is the status of this PR? |
It works, the tests pass, and performance-wise, it's the best solution I've found for integrating .NET with Spark. The next steps are on Microsoft's side. I'm also working on implementing CoGrouped UDFs, and I plan to push those updates here as well |
Can you investigate if your solution work in polyglot .NET notebook interactive? Previously we all had problem with the UDF after making adjustment to migrate to .net6.
|
Any idea who is "in charge" of this repo? |
I can take a look, but only if a lonely evening with bad weather rolls around :) No promises, as this isn’t my primary focus. There are two suggestions from developers that might help. The first is for a separate code cell, and the second is for a separate environment variable. Have you tried both approaches, and does the issue still persist? |
Hi Ihor (@grazy27), thanks for the contribution! I recently get the write permission of this repo and happy to move this forward. Due to limited bandwidth in our team and other priorities, we don't have concrete work items on this project. But we can review your code and let's work together to move this forward! |
Hello Dan <@wudanzy>, That's fantastic news—great to hear! I'd be happy to help with a few more issues to get this project back on track. In my opinion, the most important ones are:
|
Thanks for sharing that! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we split this PR a little bit? Which can speed up the review.
} | ||
catch (Exception) | ||
{ | ||
// It tries to delete non-existent file, but other from that its ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those exceptions excepted? If we expect them, we could add some logic here, if not, we could fail the test in such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My logic here is that since nothing related to this API changed inside Dotnet.Spark, and it just calls AddArchive
on JVM sparkContext, and archive is added successfully - it must be an internal bug in spark itself.
I tested it with Scala directly, and it fails with the same exception.
I plan to test it more and report to Spark later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/CatalogTests.cs
Outdated
Show resolved
Hide resolved
src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/DataFrameTests.cs
Outdated
Show resolved
Hide resolved
@@ -839,4 +838,143 @@ private CommandExecutorStat ExecuteDataFrameGroupedMapCommand( | |||
return stat; | |||
} | |||
} | |||
|
|||
internal class ArrowOrDataFrameCoGroupedMapCommandExecutor : ArrowOrDataFrameSqlCommandExecutor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Igor, is it possible to split this PR a little bit? This PR is huge, better to only contain 3.5 support and Net 8 support. We can leave other changes to future PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, Dan. I'll create a separate PR for CoGrouped UDFs and binary serializer. Vast majority of other fixes are related to each other though, so PR will still be relatively large.
At one point, I was unsure if this would ever get merged, so I ended up including all the improvements I needed to properly test whether the library meets my requirements in one place, so that if someone wants to build a version it's relatively simple to accomplish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will help this project if the support of polyglot notebook is included
#1178 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible to setup CI/CD so that each PR usability can be tracked?
@SparkSnail
https://github.com/SparkSnail/spark/actions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible to setup CI/CD so that each PR usability can be tracked? @SparkSnail https://github.com/SparkSnail/spark/actions
Yes, current test pipeline for the repo is broken, we are working to recover pipeline for PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed unnecessary refactoring and new features from the PR, preserved only .NET, Spark 3.5 and a few fixes
src/csharp/Microsoft.Spark.Worker/Processor/TaskContextProcessor.cs
Outdated
Show resolved
Hide resolved
333ea16
to
9e30f44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @grazy27, I got a basic idea of what is changed, overall, it looks good to me. One thing I found that the scala files content are not changed too much Could you please see if you can move the files instead of adding new ones, which helps highlight what is changed.
src/csharp/Microsoft.Spark.E2ETest/Microsoft.Spark.E2ETest.csproj
Outdated
Show resolved
Hide resolved
src/csharp/Microsoft.Spark.UnitTest/Microsoft.Spark.UnitTest.csproj
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,91 @@ | |||
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file copied from src/scala/microsoft-spark-3-2/pom.xml? Can you try to move it and then modify? Similar to this: 80c745b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It highlights what is changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it's already done, but as a separate commit: e7eccdf. Please let me know if that's ok.
There are 4 commits in total, for copypaste, for 3.5.1 fixes, for .net8 and for databricks fixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
@wudanzy done, let's see if it passes e2e tests with 2.4. |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
Build has passed, let's wait for tests. |
Failed for 3.5.x tests, spark distributions are not downloaded. |
We have to change https://github.com/dotnet/spark/blob/main/azure-pipelines-e2e-tests-template.yml. For 3.5.x, it used hadoop 3 instead of hadoop 2.7. |
Hi @grazy27, I also found that we need winutils for 3.5.x. I will take a look to see how to compile one today. |
Oh, I mistake the above versions for spark versions, it is actually hadoop versions, so we already have that. So it would work if you change https://github.com/dotnet/spark/blob/main/azure-pipelines-e2e-tests-template.yml |
On it |
@SparkSnail Hello! You may have more background in this, what's your opinion? Is there any limitations I'm missing, or its worth giving it a try? I'm troubleshooting failing build pipeline, and can see that only for 3.3.3 there's a hadoop3 bin, but for 3.3.4 there's 2. Is it possible to switch to using hadoop 3 for 3.3.0+?
If so, I'd like to try keeping only one if: |
@SparkSnail is on vocation, I think we can first have a try. |
Wise decision, thanks for the update :) Committed usage of hadoop3 for 3.3.0+ |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
The tests passed! Thanks for the investigation and fix! It is too late today, I will review tomorrow. |
StageId = SerDe.ReadInt32(stream), | ||
PartitionId = SerDe.ReadInt32(stream), | ||
AttemptNumber = SerDe.ReadInt32(stream), | ||
AttemptId = SerDe.ReadInt64(stream), | ||
}; | ||
|
||
// Needed for 3.3.0+ | ||
// https://issues.apache.org/jira/browse/SPARK-36173 | ||
private static TaskContext ReadTaskContext_3_3(Stream stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if ReadTaskContext_3_3 can rely on ReadTaskContext_2_x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it would be nice to reuse common logic, let's have a look at it after removing support for the obsolete spark versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! And let's wait for @SparkSnail for another review.
Thanks, Dan, and happy New Year! |
Thanks, happy new year! |
The change to use Hadoop3 in Spark 3.3.3 is because they doesn't provide hadoop3 binary downloading in official website, and there is Hadoop2 binary from Spark 3.3.4, hadoop3 is only a workaround for Spark3.3.3 and still use Hadoop2 binary to keep align with previous versions. I'm fine to upgrade to Hadoop3 if Spark 3.5 also need Hadoop3. |
Hi @grazy27, I remembered that you preferred rebase and merge, but that was disabled, and I didn't find a way to enable it. |
Merged, thanks for your contribution! @grazy27 |
Changes:
Tested with:
Spark:
Databricks:
Fails on 15.4:
The following error occurs:
Works on 14.3:
Tested on Databricks 14.3, and it works. However, there is a missing functionality for Vector UDFs.
Since
UseArrow
is always set totrue
on Databricks, Vector UDFs do not function properly and can crash the entire job. This occurs because Spark splits a single expectedRecordBatch
into a collection of smaller batches, while the code assumes a single batch.Relevant Spark settings:
useArrow, maxRecordsPerBatch
.Affected Tickets: