Improve performance of index cleanup: use readdir(3), not access(2) #2819
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change makes index cleanup ~4x faster by changing how we determine whether a file mentioned by the database still exists on disk. Previously, we'd call access(2) for each file the database mentioned. Doing so produced a lot of system call overhead. Now, we read the directory entries of the directories containing the files whose existence we're checking, build a hash table from what we find, then do the existence check against this hash table instead of entering the kernel.
The semantics of the cleanup check do change subtly, however. Previously, we checked whether the mentioned file was readable. Now we check merely that it exists. Extant but unreadable files in maildirs should be rare.
BEFORE:
$ time mu index --lazy-check
lazily indexing maildir /home/dancol/Mail -> store /home/dancol/.cache/mu/xapian / indexing messages; checked: 0; updated/new: 0; cleaned-up: 0
real 0m19.310s
user 0m1.803s
sys 0m12.999s
AFTER:
$ time mu --debug index --lazy-check
lazily indexing maildir /home/dancol/Mail -> store /home/dancol/.cache/mu/xapian
real 0m4.584s
user 0m2.433s
sys 0m2.133s