PySpark Arrow Stream Serializer #3

BryanCutler · 2017-05-22T21:37:51Z

Enable UDF evaluation with Arrow using stream format to load as Pandas Series, modified PythonRDD to support this and maintain backwards compatibility.

…cope in Arrow

BryanCutler · 2017-05-22T22:10:27Z

@icexelloss , here is what I had so far. Feel free to use what you like and let me know if you have any questions.

…s Series, modified PythonRDD to support this and maintain backwards compatibility

icexelloss · 2017-05-24T14:09:44Z

Thanks Bryan! This is quite a bit a change. I will take a look this week.

BryanCutler · 2017-05-24T17:52:49Z

Sure, no problem!

icexelloss · 2017-05-24T20:44:11Z

Bryan,

You mentioned a ~2.5x speed up comparing this and the original udf methods. How did you run the experiments? I am trying to reproduce your results.

BryanCutler · 2017-05-25T22:50:32Z

I was basically using the code below, just manually turning on/off Arrow by commenting a couple lines (I left a note in the code as to what needs to be commented out)

nrows = 1 << 24
df = spark.range(0, nrows, 1, 4).toDF("a").cache()
is_odd = udf(lambda n: n % 2, LongType())
odd = df.withColumn("is_odd", is_odd(col("a")))
flt_odd = odd.filter("is_odd == 1")
t = timeit.repeat(lambda: flt_odd.count(), repeat=10, number=1)
time_df = pd.Series(t)
print(time_df.describe())

* null values properly returned * create joined row object

BryanCutler force-pushed the pandas-udf-integration branch from f92d865 to e54cd16 Compare May 22, 2017 21:52

BryanCutler added 2 commits May 22, 2017 15:02

upgrade to use Arrow 0.4

f4a7baf

removed exclusion for log4j-over-slf4j as it had been moved to test s…

57a2cde

…cope in Arrow

BryanCutler force-pushed the pandas-udf-integration branch from e54cd16 to 0f294c2 Compare May 22, 2017 22:05

Enable UDF evaluation with Arrow using stream format to load as Panda…

45db636

…s Series, modified PythonRDD to support this and maintain backwards compatibility

BryanCutler force-pushed the pandas-udf-integration branch from 0f294c2 to 45db636 Compare May 22, 2017 22:11

icexelloss pushed a commit that referenced this pull request Aug 15, 2019

null values properly returned (#3)

9edb9b3

* null values properly returned * create joined row object

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PySpark Arrow Stream Serializer #3

PySpark Arrow Stream Serializer #3

BryanCutler commented May 22, 2017

BryanCutler commented May 22, 2017

icexelloss commented May 24, 2017

BryanCutler commented May 24, 2017

icexelloss commented May 24, 2017

BryanCutler commented May 25, 2017

PySpark Arrow Stream Serializer #3

Are you sure you want to change the base?

PySpark Arrow Stream Serializer #3

Conversation

BryanCutler commented May 22, 2017

BryanCutler commented May 22, 2017

icexelloss commented May 24, 2017

BryanCutler commented May 24, 2017

icexelloss commented May 24, 2017

BryanCutler commented May 25, 2017