Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourceDirectoryDeep False could be much faster on Linux for large directories #459

Open
nh2 opened this issue Feb 19, 2021 · 0 comments
Open

Comments

@nh2
Copy link

nh2 commented Feb 19, 2021

Just writing this down somewhere, with a suggestion for the solution:

sourceDirectoryDeep unnecessarily runs the stat() syscall on every file, thus making it slow on large directories or directory trees. For example, it takes 30 minutes on my server to run it on a directory tree with 8M files. find does it in 30 seconds. sourceDirectory does not have this problem, but cannot traverse into subdirs. On Linux, the problem can be observed with strace.

If it didn't stat(), only using getdents64() to list the directory contents (which retrieves multiple file names in a batch), it would be at least 60 times faster.

The stat() happens in the call to getFileType here:

ft <- liftIO $ F.getFileType fp

It does that to distinguish normal files from symlinks to implement the followSymlinks :: Bool feature.

However the find UNIX utility also doesn't stat() (unless the -L option is given to follow symlinks); yet it can traverse into subdirectories. How does it do it?

First the call graph:

Now man 3 readdir shows that for one readdir() call, the following information about a single directory entry dirent is returned (behind the scenes, the libc usually requests multiple -- usually 32 KiB -- of those in a batch via the getdents() syscall to be efficient):

struct dirent {
    ...
    unsigned char  d_type;      /* Type of file; not supported
                                   by all filesystem types */
    ...
};

The char d_type already tells us if the directory entry is a subdirectory or a normal file. We do not need to call stat() (via getFileType) to determine that. We only need to do that when we want to know if something is a symlink.

That means sourceDirectoryDeep could be optimised to work like find: Only if followSymlinks = True do we need to getFileType.

Unfortunately, streaming-commons's readDirStream is too simplistic for this purpose: It only returns Maybe FilePath, not propagating whether that FilePath is a directory.

This means that either sourceDirectoryDeep cannot use streaming-commons for that purpose, or streaming-commons needs to add a function that forwards that information when present.

CC @Gabriel439 as you may also have callers to readDirStream. However, I checked some pipes packages, and found that pipes-files by @jwiegley has functionality to skip the stat() mentioned in the Performance section, and pipes-filesystem by @jonathanknowles also has a Performance section that addresses the exact problem I'm mentioning here.

Other new IO packages also have addressed this issue: stdio here and Z-IO here.

Those also have the benefit of not using type FilePath = String, but ByteString based/equivalent IO, which is naturally much lighter on the CPU.

Another small complication is that this works only on Linux; a POSIX dirent does not have the file type, and some more exotic Linux file systems may also not retrieve this information. This can be addressed as the man 3 readdir explains for d_type, checking for DT_UNKNOWN and doing a stat() if encountered:

DT_UNKNOWN  The file type could not be determined.

Currently,  only some filesystems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type.
All applications must properly handle a return of DT_UNKNOWN.

An implementation of that can be seen in the posix-paths package that I wrote with @JohnLato in 2013:

        isDir <- case () of
            () | typ == dtDir     -> return True
               | typ == dtUnknown -> isDirectory <$> getFileStatus path
               | otherwise        -> return False

@snoyberg Do you have a preference if/how this should be done? Should we make streaming-commons smarter by adding a function there?


Side note: One can be even faster by:

  • not using the default 32 KiB buffer that readdir() gives to getdents(), but using a larger, or configurable buffer. That is described in the old post http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html. Note that post is slightly confusing because it talks about 32 K entries, when it's really Bytes. This optimisation is mainly useful for networked file systems though, where larger batching reduces roundtrips further. (For medium-long paths one 32 KiB getdents() fits >= 500 files, by the way.)
  • using ByteStrings to save CPU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant