Skip to content

[SPARK-52594][DOCS] Workaround pandas version string issue in doc generation #51301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to work around the version string issue in pandas pandas-dev/pandas#61579 by simply manually removing the string after +.

Why are the changes needed?

Now the release Spark build fails https://github.com/apache/spark/actions/runs/15920389407/job/44905725248

...
2025-06-27T08:41:09.8092197Z Traceback (most recent call last):
2025-06-27T08:41:09.8092911Z   File "/usr/local/lib/python3.9/dist-packages/sphinx/config.py", line 332, in eval_config_file
2025-06-27T08:41:09.8093443Z     exec(code, namespace)
2025-06-27T08:41:09.8093964Z   File "/opt/spark-rm/output/spark/python/docs/source/conf.py", line 33, in <module>
2025-06-27T08:41:09.8094479Z     generate_supported_api(output_rst_file_path)
2025-06-27T08:41:09.8095154Z   File "/opt/spark-rm/output/spark/python/pyspark/pandas/supported_api_gen.py", line 102, in generate_supported_api
2025-06-27T08:41:09.8096186Z     _check_pandas_version()
2025-06-27T08:41:09.8096783Z   File "/opt/spark-rm/output/spark/python/pyspark/pandas/supported_api_gen.py", line 116, in _check_pandas_version
2025-06-27T08:41:09.8097420Z     raise ImportError(msg)
2025-06-27T08:41:09.8097843Z ImportError: Warning: pandas 2.3.0 is required; your version is 2.3.0+4.g1dfc98e16a
...

Does this PR introduce any user-facing change?

No, dev-only.

How was this patch tested?

Will monitor the build

Was this patch authored or co-authored using generative AI tooling?

No,

@yaooqinn
Copy link
Member

Just for curiosity, why do we need to validate with this hard-coded PANDAS_LATEST_VERSION? Looking back at the commit history of this line, increasing PANDAS_LATEST_VERSION has never broken API related things or the documentation generation with/ the previous used PANDAS versions.

@HyukjinKwon
Copy link
Member Author

Because it requires the consistent list of API. If the pandas version is different, the docs would end up with having the inconsistent list of supported API in PySpark documentation..

@HyukjinKwon
Copy link
Member Author

In short, we're using pandas for pyzpark doc generation so need to pin the version for the official release.

@HyukjinKwon
Copy link
Member Author

Merged to master.

@yaooqinn
Copy link
Member

So it can be removed as we have pinged the version in GA release steps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants