Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use takeWhile method from Range #720

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

EnverOsmanov
Copy link
Collaborator

@EnverOsmanov EnverOsmanov commented Mar 12, 2023

The symptoms:
I have a file with ~1 million rows, 125 columns. It takes ~12 seconds to count lines with spark-excel's API V1 and ~2 minutes with API V2.

The issue:

  1. Range does not contain own optimized method filter, that's why it uses method from TraversableLike which iterates over each number in range.
  2. r.getLastCellNum evaluated for each number in range.

Here are some rough benchmarks with another file:
filter => 50 seconds
val lastCellNum => 38 seconds
withFilter => 20 seconds
takeWhile => 12 seconds
API V1 => 12 seconds
(File taken from here and manually converted to "xlsx")

PS. API V2 seems great! :)

@nightscape
Copy link
Collaborator

Hi @EnverOsmanov, thanks for the PR!
I'm slightly worried that .takeWhile slightly modifies the semantics to stop after having found the first non-matching index.
At the moment this doesn't matter because we're using a Range where the colInds are sorted, but should someone change this to an unsorted Seq[Int] it would break silently.
Of course the real performance benefit outweighs the maybe-in-some-not-too-likely-future-scenario breakage, but if you find a fast version with identical semantics it would be nicer.
Could you maybe try the following?

val lastCellNum = r.getLastCellNum
colInd
  .iterator
  .filter(_ < lastCellNum)

@EnverOsmanov
Copy link
Collaborator Author

If coldInd would be unsorted Seq[Int] and some columns would be missing, it should break test cases.

Benchmarks:
.iterator.filter => 31 seconds
.view.filter => 2 minutes

Here is the code how I read the data.

Btw, I just checked the content of colInd for the file and I see that it is "1 to 16383" while lastCellNum is 23. The reason of big range is that I specified in "dataAddress" only the starting cell.

@EnverOsmanov
Copy link
Collaborator Author

The alternative approach to avoid iteration over full colInd is in API V1:

.map(_.cellIterator().asScala.filter(c => colInd.contains(c.getColumnIndex)).toVector)

But I'm not exactly sure what was the idea behind the change in V2.

@nightscape
Copy link
Collaborator

Hmm, maybe it is to be able to do the

r.getCell(_, MissingCellPolicy.CREATE_NULL_AS_BLANK)

@quanghgx could you chime in here?

@EnverOsmanov
Copy link
Collaborator Author

EnverOsmanov commented Mar 16, 2023

If colInd would be unsorted Seq[Int] we should sort it once. Otherwise we would be filtering on full collection for each row.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants