Fix parquet file loading #16

jm771 · 2017-07-07T17:32:55Z

As of Spark 2.0, spark stopped guaranteeing parquet files will be loaded in the order that they were saved to disk. This is discussed here:
https://issues.apache.org/jira/browse/SPARK-20144 with the issue being reported by @icexelloss I believe.
This issue is illustrated in the test "it should "load from parquet"" in TimeSeriesRDDSpec.scala which fails without the rest of the pull request.

However, this issue can be resolved by sorting the partitions by the headers we've already loaded for them. The logic for this sorting is found in HeaderOrdering in RangeDependency.scala. I've fallen back to the old behaviour of sorting by partition index in the case where both partitions contain identical keys, because without this in place a lot of unit tests fail.

This logic is tested in RangeDependencySpec.scala

The rest of the pull request is just swapping from sorting by partition index to sorting using this new ordering, and in Conversion.scala preventing the PartitionsIterator from undoing our hard work sorting things. Given we've now sorted our headers by their keys, the check that headers are sorted by keys is no longer necessary in RangeDependency.scala

CLA has been emailed in.

…eir headers

icexelloss · 2017-07-10T13:44:02Z

@jm771 Thank you for the patch. This makes sense. Internally we have patched Spark to fix this issue but this makes sense to do it in Flint.

This somewhat changes the meaning of "sorted" from "data is sorted" to "data inside each partition is sorted". I think this is fine but need to think more carefully about any possible undesirable behavior.

jm771 · 2017-07-11T13:22:29Z

I believe that we don't really change that understanding at any other level than the lowest. At the end of fromSortedRDD we wrap up our newly normalised RDD in our a fresh OrderedRDD, this RDD has its partitions sorted so that data value is increasing in index. So I guess now for a "sorted RDD", that we are loading from, we don't assume that the partitions will be ordered, but an OrderedRDD definitely still will have sorted partitions.

The other thing worth mentioning is I only fixed the path that TimeSeriesRDD.fromParquet takes. So if you ever wanted to write a method called say "fromNormalizedParquet" then the fromNormalizedSortedRDD method would also need tweaking.

icexelloss · 2017-07-12T19:22:09Z

src/main/scala/com/twosigma/flint/rdd/RangeDependency.scala

-            s"the partition ${h2.partition.index} has the first key ${h2.firstKey}.")
-        }
-    }
+    val sortedHeaders = headers.sortBy(x => x).toArray


I think only sorting by firstKey, secondKey is not sufficient.

For instance, with:

Header(Partition(0), 100, ...)), Header(Partition(1), 0, ...)

We would end up with:

Header(Partition(1), 0, ...), Header(Partition(0), 100, ...)

The inconsistency between partition index and the index in the array could cause bug.

I think the correct way of handling this is to turn two partitions [100, 200) [0, 100) into:

RangeSplit(OrderedPartition(0), [0, 100)) RangeSplit(OrderedPartition(1), [100, 200))

Instead of:

RangeSplit(OrderedPartition(1), [0, 100)) RangeSplit(OrderedPartition(0), [100, 200))

I think we're ok, so we start with
headers = Header(Partition(0), 100, ...)), Header(Partition(1), 0, ...)

sortedHeaders = Header(Partition(1), 0, ...), Header(Partition(0), 100, ...)

Further down we get:
val normalizedRanges = normalisationStrategy.normalise(sortedHeaders) = [0, 100) [100, 200)
As the normalisation strategies don't look at partition index with my change in place.

We then return our RangeDependencies from this method:
RangeDependency[K, P](idx, normalizedRange, parents)
Where idx is the index from the zipWithIndex.

Which gives us
RangeDependency(0, [0, 100), (Partition(1))), RangeDependency(1, [100, 200), (Partition(0))

So at this point we have successfully re-indexed and the rest of the program can continue happily. And indeed further up the stack with in Conversion.scala we build our RangeSplit off the index of the RangeDependency:
d => RangeSplit(OrderedRDDPartition(d.index).asInstanceOf[Partition], d.range)

Which would give us
RangeSplit(OrderedRDDPartion(0), [0, 100)) RangeSplit(OrderedRDDPartion(1), [100, 200))

Just as we want.

Or is your claim that we need to have the partitions remapped before we pass them into the normalisation strategy? I feel like the normalisation strategies should just rely on data from the headers and not info about the partitions themselves.

Actually, much better to write a test rather than try to make sure this is good by code inspection.
I've written:
"fromSortedRDD" should "sort partitions, and have partition indexes increasing"

… order

dgrnbrg · 2019-10-01T14:28:25Z

Hey @icexelloss, if I wanted to revive and finish this PR, I see that you had a comment about which keys are sorted. Is that, as far as you know, the only block to this PR landing?

jm771 added 2 commits July 7, 2017 15:55

Fix parquet file loading by sorting the partitions on the basis of th…

8ed24dc

…eir headers

Correct typo

63fabac

icexelloss reviewed Jul 12, 2017

View reviewed changes

Add a test which proves we have our OrderedRDDPartitions in the right…

036f7e4

… order

d80tb7 mentioned this pull request Nov 25, 2018

Unable to load Timeseries RDD from Parquet without sorting #59

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parquet file loading #16

Fix parquet file loading #16

jm771 commented Jul 7, 2017

icexelloss commented Jul 10, 2017 •

edited

Loading

jm771 commented Jul 11, 2017 •

edited

Loading

icexelloss Jul 12, 2017 •

edited

Loading

jm771 Jul 13, 2017 •

edited

Loading

jm771 Jul 13, 2017 •

edited

Loading

dgrnbrg commented Oct 1, 2019

Fix parquet file loading #16

Are you sure you want to change the base?

Fix parquet file loading #16

Conversation

jm771 commented Jul 7, 2017

icexelloss commented Jul 10, 2017 • edited Loading

jm771 commented Jul 11, 2017 • edited Loading

icexelloss Jul 12, 2017 • edited Loading

Choose a reason for hiding this comment

jm771 Jul 13, 2017 • edited Loading

Choose a reason for hiding this comment

jm771 Jul 13, 2017 • edited Loading

Choose a reason for hiding this comment

dgrnbrg commented Oct 1, 2019

icexelloss commented Jul 10, 2017 •

edited

Loading

jm771 commented Jul 11, 2017 •

edited

Loading

icexelloss Jul 12, 2017 •

edited

Loading

jm771 Jul 13, 2017 •

edited

Loading

jm771 Jul 13, 2017 •

edited

Loading