Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance in load.corpus #36

Open
adunmore opened this issue Feb 26, 2020 · 2 comments
Open

Performance in load.corpus #36

adunmore opened this issue Feb 26, 2020 · 2 comments

Comments

@adunmore
Copy link
Collaborator

I am investigating performance problems in load.corpus. I think that performance could be improved significantly by replacing scan with another approach to loading files.

This flame graph from profiling load.corpus shows that most of the run time is accounted for by scan

image

I ran a benchmark comparing the call to scan in load.corpus with two other functions for reading text files, readChar and readLines.
image

readChar runs on the same file in ~10% of the time. However, while the current approach returns each text split on '\n', this function returns each file's contents as a single string.

Do the downstream downstream text processing functions (make.samples, txt.to.words.ext, delete markup, etc) require each text to be split into lines? If they do, maybe we could modify txt.to.words.ext or another downstream function to handle that step in one of the tokenization loops that already occurs.

@adunmore
Copy link
Collaborator Author

Swapped readChar for scan in a8bf057 on /experimental. load.corpus now runs almost instantly on my 1000 item corpus, and load.corpus.and.parse completes without errors. Still warrants more investigation to make sure this approach is compatible.

@adunmore
Copy link
Collaborator Author

adunmore commented Mar 4, 2020

Both delete.markup and txt.to.words can accept individual texts as whole strings. So the existing code is compatible with my approach in a8bf057.

I think this code is ready to be merged with the main branch.

@adunmore adunmore mentioned this issue Mar 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant