Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of index cleanup: use readdir(3), not access(2) #2819

Closed
wants to merge 1 commit into from

Conversation

dcolascione
Copy link
Contributor

This change makes index cleanup ~4x faster by changing how we determine whether a file mentioned by the database still exists on disk. Previously, we'd call access(2) for each file the database mentioned. Doing so produced a lot of system call overhead. Now, we read the directory entries of the directories containing the files whose existence we're checking, build a hash table from what we find, then do the existence check against this hash table instead of entering the kernel.

The semantics of the cleanup check do change subtly, however. Previously, we checked whether the mentioned file was readable. Now we check merely that it exists. Extant but unreadable files in maildirs should be rare.

BEFORE:

$ time mu index --lazy-check
lazily indexing maildir /home/dancol/Mail -> store /home/dancol/.cache/mu/xapian / indexing messages; checked: 0; updated/new: 0; cleaned-up: 0

real 0m19.310s
user 0m1.803s
sys 0m12.999s

AFTER:

$ time mu --debug index --lazy-check
lazily indexing maildir /home/dancol/Mail -> store /home/dancol/.cache/mu/xapian

  • indexing messages; checked: 0; updated/new: 0; cleaned-up: 0

real 0m4.584s
user 0m2.433s
sys 0m2.133s

This change makes index cleanup ~4x faster by changing how we
determine whether a file mentioned by the database still exists on
disk.  Previously, we'd call access(2) for each file the database
mentioned.  Doing so produced a lot of system call overhead.  Now, we
read the directory entries of the directories containing the files
whose existence we're checking, build a hash table from what we find,
then do the existence check against this hash table instead of
entering the kernel.

The semantics of the cleanup check do change subtly, however.
Previously, we checked whether the mentioned file was *readable*.
Now we check merely that it exists.  Extant but unreadable files in
maildirs should be rare.

BEFORE:

$ time mu index --lazy-check
lazily indexing maildir /home/dancol/Mail -> store /home/dancol/.cache/mu/xapian
/ indexing messages; checked: 0; updated/new: 0; cleaned-up: 0

real    0m19.310s
user    0m1.803s
sys     0m12.999s

AFTER:

$ time mu --debug index --lazy-check
lazily indexing maildir /home/dancol/Mail -> store /home/dancol/.cache/mu/xapian
- indexing messages; checked: 0; updated/new: 0; cleaned-up: 0

real    0m4.584s
user    0m2.433s
sys     0m2.133s
@djcb
Copy link
Owner

djcb commented Feb 23, 2025

Thanks! Merged locally / pushed.

@djcb djcb closed this Feb 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants