You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Just writing this down somewhere, with a suggestion for the solution:
sourceDirectoryDeep unnecessarily runs the stat() syscall on every file, thus making it slow on large directories or directory trees. For example, it takes 30 minutes on my server to run it on a directory tree with 8M files. find does it in 30 seconds. sourceDirectory does not have this problem, but cannot traverse into subdirs. On Linux, the problem can be observed with strace.
If it didn't stat(), only using getdents64() to list the directory contents (which retrieves multiple file names in a batch), it would be at least 60 times faster.
The stat() happens in the call to getFileType here:
It does that to distinguish normal files from symlinks to implement the followSymlinks :: Bool feature.
However the find UNIX utility also doesn't stat() (unless the -L option is given to follow symlinks); yet it can traverse into subdirectories. How does it do it?
which are foreign imports of __hscore_opendir+__hscore_readdir
which call the opendir() libc function and the unix package's wrapper around the libc readdir() function.
Now man 3 readdir shows that for one readdir() call, the following information about a single directory entry dirent is returned (behind the scenes, the libc usually requests multiple -- usually 32 KiB -- of those in a batch via the getdents() syscall to be efficient):
structdirent {
...
unsigned chard_type; /* Type of file; not supported by all filesystem types */
...
};
The char d_type already tells us if the directory entry is a subdirectory or a normal file. We do not need to call stat() (via getFileType) to determine that. We only need to do that when we want to know if something is a symlink.
That means sourceDirectoryDeep could be optimised to work like find: Only if followSymlinks = True do we need to getFileType.
Unfortunately, streaming-commons's readDirStream is too simplistic for this purpose: It only returns Maybe FilePath, not propagating whether that FilePath is a directory.
This means that either sourceDirectoryDeep cannot use streaming-commons for that purpose, or streaming-commons needs to add a function that forwards that information when present.
CC @Gabriel439 as you may also have callers to readDirStream. However, I checked some pipes packages, and found that pipes-files by @jwiegley has functionality to skip the stat() mentioned in the Performance section, and pipes-filesystem by @jonathanknowles also has a Performance section that addresses the exact problem I'm mentioning here.
Other new IO packages also have addressed this issue: stdiohere and Z-IOhere.
Those also have the benefit of not using type FilePath = String, but ByteString based/equivalent IO, which is naturally much lighter on the CPU.
Another small complication is that this works only on Linux; a POSIX dirent does not have the file type, and some more exotic Linux file systems may also not retrieve this information. This can be addressed as the man 3 readdir explains for d_type, checking for DT_UNKNOWN and doing a stat() if encountered:
DT_UNKNOWN The file type could not be determined.
Currently, only some filesystems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type.
All applications must properly handle a return of DT_UNKNOWN.
isDir <-case()of()| typ == dtDir ->returnTrue| typ == dtUnknown -> isDirectory <$> getFileStatus path
|otherwise->returnFalse
@snoyberg Do you have a preference if/how this should be done? Should we make streaming-commons smarter by adding a function there?
Side note: One can be even faster by:
not using the default 32 KiB buffer that readdir() gives to getdents(), but using a larger, or configurable buffer. That is described in the old post http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html. Note that post is slightly confusing because it talks about 32 K entries, when it's really Bytes. This optimisation is mainly useful for networked file systems though, where larger batching reduces roundtrips further. (For medium-long paths one 32 KiB getdents() fits >= 500 files, by the way.)
using ByteStrings to save CPU.
The text was updated successfully, but these errors were encountered:
Just writing this down somewhere, with a suggestion for the solution:
sourceDirectoryDeep
unnecessarily runs thestat()
syscall on every file, thus making it slow on large directories or directory trees. For example, it takes 30 minutes on my server to run it on a directory tree with 8M files.find
does it in 30 seconds.sourceDirectory
does not have this problem, but cannot traverse into subdirs. On Linux, the problem can be observed withstrace
.If it didn't
stat()
, only usinggetdents64()
to list the directory contents (which retrieves multiple file names in a batch), it would be at least 60 times faster.The
stat()
happens in the call togetFileType
here:conduit/conduit/src/Data/Conduit/Combinators.hs
Line 737 in 0bf72ae
It does that to distinguish normal files from symlinks to implement the
followSymlinks :: Bool
feature.However the
find
UNIX utility also doesn'tstat()
(unless the-L
option is given to follow symlinks); yet it can traverse into subdirectories. How does it do it?First the call graph:
sourceDirectory
callsopenDirStream
+readDirStream
fromstreaming-commons
System.Posix.Directory.openDirStream
+readDirStream
fromunix
, which callsc_opendir
/c_readdir
foreign import
s of__hscore_opendir
+__hscore_readdir
opendir()
libc function and theunix
package's wrapper around the libcreaddir()
function.Now
man 3 readdir
shows that for onereaddir()
call, the following information about a single directory entrydirent
is returned (behind the scenes, the libc usually requests multiple -- usually 32 KiB -- of those in a batch via thegetdents()
syscall to be efficient):The
char d_type
already tells us if the directory entry is a subdirectory or a normal file. We do not need to callstat()
(viagetFileType
) to determine that. We only need to do that when we want to know if something is a symlink.That means
sourceDirectoryDeep
could be optimised to work likefind
: Only iffollowSymlinks = True
do we need togetFileType
.Unfortunately,
streaming-commons
'sreadDirStream
is too simplistic for this purpose: It only returnsMaybe FilePath
, not propagating whether thatFilePath
is a directory.This means that either
sourceDirectoryDeep
cannot usestreaming-commons
for that purpose, orstreaming-commons
needs to add a function that forwards that information when present.CC @Gabriel439 as you may also have callers to
readDirStream
. However, I checked somepipes
packages, and found thatpipes-files
by @jwiegley has functionality to skip thestat()
mentioned in the Performance section, andpipes-filesystem
by @jonathanknowles also has a Performance section that addresses the exact problem I'm mentioning here.Other new IO packages also have addressed this issue:
stdio
here andZ-IO
here.Those also have the benefit of not using
type FilePath = String
, butByteString
based/equivalent IO, which is naturally much lighter on the CPU.Another small complication is that this works only on Linux; a POSIX
dirent
does not have the file type, and some more exotic Linux file systems may also not retrieve this information. This can be addressed as theman 3 readdir
explains ford_type
, checking forDT_UNKNOWN
and doing astat()
if encountered:An implementation of that can be seen in the
posix-paths
package that I wrote with @JohnLato in 2013:@snoyberg Do you have a preference if/how this should be done? Should we make
streaming-commons
smarter by adding a function there?Side note: One can be even faster by:
readdir()
gives togetdents()
, but using a larger, or configurable buffer. That is described in the old post http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html. Note that post is slightly confusing because it talks about 32 K entries, when it's really Bytes. This optimisation is mainly useful for networked file systems though, where larger batching reduces roundtrips further. (For medium-long paths one 32 KiBgetdents()
fits >= 500 files, by the way.)The text was updated successfully, but these errors were encountered: