[WIP] Auto infer schema (including fields shape) from the first row #512

WeichenXu123 · 2020-03-23T05:37:59Z

What issues does the PR addresses ?

There're 2 issues in make_batch_reader, one is critical and another is less critical but a pain point.

(Critical) Inferring schema in `make_batch_reader` cannot infer fields' shape information

Because there's no shape information, when make tensorflow dataset from the reader, if we make some tensorflow dataset operations, such as unroll, batch, and reshape field, error may occur. Tensorflow graph operator depends on field shape information heavily.

(Pain point) The `TransformSpec` need to specify edit/removed fields manually

We hope user can only provide a transform function, and petastorm can automatically infer the result schema from the output pandas dataframe of the transform function.

The approach in the PR

Add a method ArrowReaderWorker. infer_schema_from_first_row which can read a row first and infer the schema from the row. So that we can infer the accurate shape information.
Add a param infer_schema_from_first_row into make_batch_reader (default off, so won't break API behavior)

Limitations:

for all rows (before applying predicates), require all values in each field non-nullable and having the same shape.

Test

Unit test to be added. But it is ready for first review.

Example code

import os
import pandas as pd
import sys
import numpy as np
from pyspark.sql.functions import pandas_udf
import tensorflow as tf

from petastorm import make_batch_reader
from petastorm.transform import TransformSpec
from petastorm.spark import make_spark_converter
spark.conf.set('petastorm.spark.converter.parentCacheDirUrl', 'file:/tmp/converter')

data_url = 'file:/tmp/0001'
data_path = '/tmp/t0001'

@pandas_udf('array<float>')
def gen_array(v):
  return v.map(lambda x: np.random.rand(10))

df1 = spark.range(10).withColumn('v', gen_array('id')).repartition(2)
cv1 = make_spark_converter(df1)

# we can auto infer one-dim array shape
with cv1.make_tf_dataset(batch_size=4, num_epochs=1) as dataset:
	iter = dataset.make_one_shot_iterator()
	next_op = iter.get_next()
	with tf.Session() as sess:
		for i in range(3):
			batch = sess.run(next_op)
			print(batch)


def preproc_fn(x):
  # reshape column 'v' to (2, 5) shape.
  x2 = pd.DataFrame({'v': x['v'].map(lambda x: x.reshape((2, 5))), 'id': x['id'] + 10000})
  return x2

# now we can auto infer multi-dim array shape.
with cv1.make_tf_dataset(batch_size=4, preprocess_fn=preproc_fn, num_epochs=1) as dataset:
	iter = dataset.make_one_shot_iterator()
	next_op = iter.get_next()
	with tf.Session() as sess:
		for i in range(3):
			batch = sess.run(next_op)
			print(batch)

codecov · 2020-03-23T09:09:10Z

Codecov Report

Merging #512 into master will decrease coverage by 0.16%.
The diff coverage is 72.91%.

@@            Coverage Diff             @@
##           master     #512      +/-   ##
==========================================
- Coverage   86.02%   85.86%   -0.17%     
==========================================
  Files          81       81              
  Lines        4402     4442      +40     
  Branches      704      713       +9     
==========================================
+ Hits         3787     3814      +27     
- Misses        504      511       +7     
- Partials      111      117       +6

Impacted Files	Coverage Δ
petastorm/tf_utils.py	`80.91% <ø> (ø)`	⬆️
petastorm/spark/spark_dataset_converter.py	`87.5% <25%> (-3.13%)`	⬇️
petastorm/reader.py	`90.32% <77.77%> (-0.68%)`	⬇️
petastorm/arrow_reader_worker.py	`90.34% <83.87%> (-1.66%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b70510...529cb83. Read the comment docs.

liangz1

A Convenient feature that would simplify the schema issue! I left a couple of questions.

liangz1 · 2020-03-23T17:00:23Z

petastorm/arrow_reader_worker.py

@@ -168,6 +170,38 @@ def process(self, piece_index, worker_predicate, shuffle_row_drop_partition):
        if all_cols:
            self.publish_func(all_cols)

+    def infer_schema_from_first_row(self):


nit: I'm not sure whether the partition[0] necessarily contains the "first" row? Could the partitions be out of order? If so, we may call it infer_schema_from_a_row.

Here I read the first row in the index-0 row-groups. But index-0 row-groups may be non-deterministic ? Not sure. infer_schema_from_a_row sounds good.

liangz1 · 2020-03-23T17:18:31Z

petastorm/spark/spark_dataset_converter.py

+
+        if 'transform_spec' in petastorm_reader_kwargs or \
+                'infer_schema_from_first_row' in petastorm_reader_kwargs:
+            raise ValueError('User cannot set transform_spec and infer_schema_from_first_row '


Shall we also allow users to use transform_spec&infer_schema_from_first_row? Keeping transform_spec would make it consistent with the rest of the petastorm library.

I think the param preprocess_fn should cover the functionality of transform_spec, and it is easier to use (can auto inferring result schema), so I forbid the two params.

WeichenXu123 · 2020-03-25T04:11:28Z

I create a simple PR to address issue 1, #517
We can merge that one first.
This PR could be a long-term work.

WeichenXu123 added 2 commits March 23, 2020 11:37

init

2295490

update

6ded627

WeichenXu123 changed the title ~~[WIP] Auto infer schema from first row~~ [WIP] Auto infer schema (including fields shape) from the first row Mar 23, 2020

WeichenXu123 added 2 commits March 23, 2020 14:19

update

105f2ad

fix doc

5d01502

WeichenXu123 mentioned this pull request Mar 23, 2020

[WIP][ML-10118] Keep petastorm dataset/dataloader schema fields order the same with spark dataframe #511

Closed

liangz1 reviewed Mar 23, 2020

View reviewed changes

liangz1 added a commit to liangz1/petastorm that referenced this pull request Mar 24, 2020

merge uber#512 auto infer, test fails

6a1c889

WeichenXu123 added 2 commits March 24, 2020 19:54

update

8d41e70

update

529cb83

WeichenXu123 closed this Mar 25, 2020

WeichenXu123 reopened this Mar 25, 2020

selitvin mentioned this pull request Apr 2, 2020

global context not imported in transform_spec function with reader_pool_type="process" #524

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Auto infer schema (including fields shape) from the first row #512

[WIP] Auto infer schema (including fields shape) from the first row #512

WeichenXu123 commented Mar 23, 2020 •

edited

Loading

codecov bot commented Mar 23, 2020 •

edited

Loading

liangz1 left a comment

liangz1 Mar 23, 2020

WeichenXu123 Mar 24, 2020

liangz1 Mar 23, 2020

WeichenXu123 Mar 24, 2020

WeichenXu123 commented Mar 25, 2020

[WIP] Auto infer schema (including fields shape) from the first row #512

Are you sure you want to change the base?

[WIP] Auto infer schema (including fields shape) from the first row #512

Conversation

WeichenXu123 commented Mar 23, 2020 • edited Loading

What issues does the PR addresses ?

(Critical) Inferring schema in make_batch_reader cannot infer fields' shape information

(Pain point) The TransformSpec need to specify edit/removed fields manually

The approach in the PR

Test

Example code

codecov bot commented Mar 23, 2020 • edited Loading

Codecov Report

liangz1 left a comment

Choose a reason for hiding this comment

liangz1 Mar 23, 2020

Choose a reason for hiding this comment

WeichenXu123 Mar 24, 2020

Choose a reason for hiding this comment

liangz1 Mar 23, 2020

Choose a reason for hiding this comment

WeichenXu123 Mar 24, 2020

Choose a reason for hiding this comment

WeichenXu123 commented Mar 25, 2020

WeichenXu123 commented Mar 23, 2020 •

edited

Loading

(Critical) Inferring schema in `make_batch_reader` cannot infer fields' shape information

(Pain point) The `TransformSpec` need to specify edit/removed fields manually

codecov bot commented Mar 23, 2020 •

edited

Loading