-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54153][PYTHON] Support profiling iterator based Python UDFs #52853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The OG profilers via |
python/pyspark/worker.py
Outdated
| return profiling_func | ||
|
|
||
|
|
||
| def wrap_iter_perf_profiler(f, result_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a new wrap function here? We only need a different profiling_func right? Should we do wrap_perf_profiler(f, eval_type, result_id) and do all the logic inside this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought about doing that, I can update
| return f, args_offsets | ||
|
|
||
|
|
||
| def _supports_profiler(eval_type: int) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ueshin to check whether iterator-based UDFs can be supported in memory profiler now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some code to support memory profiler: https://github.com/apache/spark/pull/52853/files#diff-e91c3ea5c69ba1ce65bbaae56a7164a927bb013aac053d5167936ce9c9522051R1228?
Also the description contains the example output.
|
cc @allisonwang-db too |
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Let me try something in my local, but basically LGTM.
| jobArtifactUUID, | ||
| sessionUUID) | ||
| sessionUUID, | ||
| None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also enable it here? IIUC, this will run on executors, so it would help to debug python datasources. cc @allisonwang-db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep added it and a simple test for it
| @pandas_udf("long") | ||
| def add1(x): | ||
| return x + 1 | ||
| def add1(iter: Iterator[pd.Series]) -> Iterator[pd.Series]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code has 2 UDFs because one of them is not supported. It tests how an unsupported UDF works with a supported UDF. Now having two exact same UDFs does not make sense anymore. It's a bit confusing actually.
I would suggest to make this simple enough. Just test how a single iterator UDF works with profiler and confirm that we can read profiling data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I wasn't sure how much to modify things, I can update
python/pyspark/worker.py
Outdated
| def profiling_func(*args, **kwargs): | ||
| profiler = UDFLineProfilerV2() | ||
|
|
||
| wrapped = profiler(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would highly suggest making the two profiling_func use the same path. We should either use add_function then with, or call profiler. I'm assuming the wrapped way is not easily applicable to iterators. I think it will eliminate a lot of confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree, updated to use the same context manager approach. And yeah the explicit add_function is necessary for the iterator approach to work
python/pyspark/worker.py
Outdated
| if _supports_profiler(eval_type) and has_memory_profiler: | ||
| profiling_func = wrap_memory_profiler(chained_func, result_id) | ||
| else: | ||
| if not has_memory_profiler: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably move this check to wrap_memory_profiler() and let it handle the logic. I always think we should handle these logic in one location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, done
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome
What changes were proposed in this pull request?
Updates the v2 Spark-session based Python UDF profiler to support profiling iterator based UDFs.
Why are the changes needed?
To add valuable profiling support to all types of UDFs.
Does this PR introduce any user-facing change?
Yes, iterator based Python UDFs can now be profiled with the SQL config based profiler.
How was this patch tested?
Updated UTs that were specifically testing that this wasn't supported to show they are now supported.
Was this patch authored or co-authored using generative AI tooling?
No