Description
Currently, the read and write functionality in [pyspark.sql.readwriter] supports paths on the form PathOrPaths = Union[str, List[str]]
Pathlibs: Path is a widely used way to manage pathlike objects in python. It is heavily adopted by the community. Allowing the readers and writers to consume path objects, makes pyspark more python "native" by accepting commonly used first party data structures.
Supporting os.PathLike objects would reduce friction between spark and python
Motivation
pathlib is part of the Python standard library and is widely adopted across the ecosystem due to its improved readability, and safety compared to raw strings. Many Python libraries already accept PathLike objects (via os.PathLike), making this a good inclusion for aligning pyspark with modern python.
Users working with PySpark often need to manually convert Path objects to strings before passing them into Spark APIs. This introduces unnecessary friction and deviates from commonly adopted python practices.
Supporting PathLike objects would:
- Align PySpark with modern Python standards
- Reduce boilerplate conversions
(str(path)) or os.fspath(path)
- Make PySpark feel more "native" in Python environments
Proposed Change
Extend the accepted input types for path arguments in pyspark.sql.readwriter from:
PathOrPaths = Union[str, List[str]]
to PathOrPaths = Union[str, os.PathLike, List[Union[str, os.PathLike]]]
Internally the PathLike objects would be normalized back to strings before being passed along to the jreader
This change is fully backward compatible, as it only expands the accepted input types without altering existing behavior.
The proposed change increases the public api's flexibility without breaking existing standards.
Contribution
I would be happy to pick it up, but would like to air the proposal for others thoughts on the matter, before working on it!
Description
Currently, the read and write functionality in [
pyspark.sql.readwriter] supports paths on the formPathOrPaths = Union[str, List[str]]Pathlibs:Pathis a widely used way to manage pathlike objects in python. It is heavily adopted by the community. Allowing the readers and writers to consume path objects, makes pyspark more python "native" by accepting commonly used first party data structures.Supporting
os.PathLikeobjects would reduce friction between spark and pythonMotivation
pathlib is part of the Python standard library and is widely adopted across the ecosystem due to its improved readability, and safety compared to raw strings. Many Python libraries already accept PathLike objects (via os.PathLike), making this a good inclusion for aligning pyspark with modern python.
Users working with PySpark often need to manually convert Path objects to strings before passing them into Spark APIs. This introduces unnecessary friction and deviates from commonly adopted python practices.
Supporting PathLike objects would:
(str(path))oros.fspath(path)Proposed Change
Extend the accepted input types for path arguments in pyspark.sql.readwriter from:
PathOrPaths = Union[str, List[str]]to
PathOrPaths = Union[str, os.PathLike, List[Union[str, os.PathLike]]]Internally the
PathLikeobjects would be normalized back to strings before being passed along to thejreaderThis change is fully backward compatible, as it only expands the accepted input types without altering existing behavior.
The proposed change increases the public api's flexibility without breaking existing standards.
Contribution
I would be happy to pick it up, but would like to air the proposal for others thoughts on the matter, before working on it!