[SPARK-54223][PYTHON] Add task context and data metrics to Python runner logs #52931
+49
−23
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Currently, the log messages in PythonRunner and related Python execution classes do not include Spark task context information during Python UDF execution.
This makes it harder to correlate Python worker timing metrics and data processing statistics with the specific Spark tasks that executed the UDFs, especially when debugging performance issues or data skew in production environments.
This improvement adds task context details along with data processing metrics to the log statements in PythonRunner and PythonUDFRunner classes to enhance traceability and debugging of Python UDF execution.
Current Behaviour
When examining executor logs, there is a disconnect between task execution logs and Python runner logs:
Expected Behaviour
After this enhancement, logs include task context information and data metrics:
Why are the changes needed?
Enable seamless correlation between task execution and Python UDF operations:
Does this PR introduce any user-facing change?
No
How was this patch tested?
Run existing test suite:
./build/mvn -pl core -am test -DwildcardSuites=org.apache.spark.deploy.PythonRunnerSuiteResult:
Was this patch authored or co-authored using generative AI tooling?
No