Use arrow package to move data to/from pyspark? #31

nealrichardson · 2023-09-19T20:40:25Z

Thinking along the lines of https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/, there could be some perf gains to use Arrow to move the data rather than going through pandas. Arrow also supports ALTREP now. Using arrow could also help prevent lossy type conversions.

Like with sparklyr, you could put arrow in Suggests so that you don't require it. Or you could use https://github.com/apache/arrow-nanoarrow, which is a much lighter dependency.

The text was updated successfully, but these errors were encountered:

edgararuiz · 2023-09-30T15:32:22Z

As per this article https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html, we will need to add pyarrow to the installed Python libraries that install_...() does. And also to enable the spark.sql.execution.arrow.pyspark.enabled Spark session configuration

edgararuiz added the in-release label Sep 27, 2023

edgararuiz removed the in-release label Sep 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use arrow package to move data to/from pyspark? #31

Use arrow package to move data to/from pyspark? #31

nealrichardson commented Sep 19, 2023

edgararuiz commented Sep 30, 2023

Use arrow package to move data to/from pyspark? #31

Use arrow package to move data to/from pyspark? #31

Comments

nealrichardson commented Sep 19, 2023

edgararuiz commented Sep 30, 2023