Use takeWhile method from Range #720

EnverOsmanov · 2023-03-12T01:17:18Z

The symptoms:
I have a file with ~1 million rows, 125 columns. It takes ~12 seconds to count lines with spark-excel's API V1 and ~2 minutes with API V2.

The issue:

Range does not contain own optimized method filter, that's why it uses method from TraversableLike which iterates over each number in range.
r.getLastCellNum evaluated for each number in range.

Here are some rough benchmarks with another file:
filter => 50 seconds
val lastCellNum => 38 seconds
withFilter => 20 seconds
takeWhile => 12 seconds
API V1 => 12 seconds
(File taken from here and manually converted to "xlsx")

PS. API V2 seems great! :)

nightscape · 2023-03-13T08:18:49Z

Hi @EnverOsmanov, thanks for the PR!
I'm slightly worried that .takeWhile slightly modifies the semantics to stop after having found the first non-matching index.
At the moment this doesn't matter because we're using a Range where the colInds are sorted, but should someone change this to an unsorted Seq[Int] it would break silently.
Of course the real performance benefit outweighs the maybe-in-some-not-too-likely-future-scenario breakage, but if you find a fast version with identical semantics it would be nicer.
Could you maybe try the following?

val lastCellNum = r.getLastCellNum
colInd
  .iterator
  .filter(_ < lastCellNum)

EnverOsmanov · 2023-03-13T09:44:26Z

If coldInd would be unsorted Seq[Int] and some columns would be missing, it should break test cases.

Benchmarks:
.iterator.filter => 31 seconds
.view.filter => 2 minutes

Here is the code how I read the data.

Btw, I just checked the content of colInd for the file and I see that it is "1 to 16383" while lastCellNum is 23. The reason of big range is that I specified in "dataAddress" only the starting cell.

EnverOsmanov · 2023-03-13T10:07:03Z

The alternative approach to avoid iteration over full colInd is in API V1:

.map(_.cellIterator().asScala.filter(c => colInd.contains(c.getColumnIndex)).toVector)

But I'm not exactly sure what was the idea behind the change in V2.

nightscape · 2023-03-13T16:52:57Z

Hmm, maybe it is to be able to do the

r.getCell(_, MissingCellPolicy.CREATE_NULL_AS_BLANK)

@quanghgx could you chime in here?

EnverOsmanov · 2023-03-16T06:36:37Z

If colInd would be unsorted Seq[Int] we should sort it once. Otherwise we would be filtering on full collection for each row.

Use takeWhile method from Range

ab2f87c

EnverOsmanov requested review from nightscape and quanghgx March 12, 2023 01:17

nightscape force-pushed the main branch from 3d855e8 to e911d0c Compare August 1, 2023 22:14

nightscape force-pushed the main branch from fa067f7 to af8172a Compare November 14, 2023 09:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use takeWhile method from Range #720

Use takeWhile method from Range #720

EnverOsmanov commented Mar 12, 2023 •

edited

Loading

nightscape commented Mar 13, 2023

EnverOsmanov commented Mar 13, 2023

EnverOsmanov commented Mar 13, 2023

nightscape commented Mar 13, 2023

EnverOsmanov commented Mar 16, 2023 •

edited

Loading

Use takeWhile method from Range #720

Are you sure you want to change the base?

Use takeWhile method from Range #720

Conversation

EnverOsmanov commented Mar 12, 2023 • edited Loading

nightscape commented Mar 13, 2023

EnverOsmanov commented Mar 13, 2023

EnverOsmanov commented Mar 13, 2023

nightscape commented Mar 13, 2023

EnverOsmanov commented Mar 16, 2023 • edited Loading

EnverOsmanov commented Mar 12, 2023 •

edited

Loading

EnverOsmanov commented Mar 16, 2023 •

edited

Loading