-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mmap files when possible to improve CLI parse performance #1274
mmap files when possible to improve CLI parse performance #1274
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I just have some questions for discussion. As is the change is an improvement certainly.
// the BucketingInputSource. Larger files cannot be mapped so we cannot avoid it | ||
val path = Paths.get(file) | ||
val size = Files.size(path) | ||
if (size <= Int.MaxValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc for the map method says:
For most operating systems, mapping a file into memory is more expensive than reading or writing a
few tens of kilobytes of data via the usual read
So should there be a floor check also e.g., below some size we just read it into a byte buffer and avoid the map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly. Though, I imagine if you're parsing a small file with the CLI then the overhead of mmap is going to be relatively small compared to the overhead of starting up a JVM and maybe the it won't make a difference? I'm not sure. We can do some experiments to see if there's a benefit for smaller files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could leave this as you have it, and we can look at nightly performance stuff to see if it slows down noticably. Lots of those will use files that are small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nightlies don't use the parse
command so won't see any change. They use the performance
command which reads test files into a byte array before testing to avoid overhead related to disk I/O.
We could create some patches that run on the nightlies, one patch change the performance command to use FileInputStream and one to use a MappedByteBuffer, which would give us an idea of mmap vs file input stream. But that's feels like a decent amount of work just to figure out an optimal size where mmap overhead > bucketing overhead. Also, based on my bucketing vs non-bucketing tests, I feel like bucketing overhead is probably more than mmap-overhead, even with small files and so we should always avoid bucketing when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Then I suggest merge as is, and we worry (or not) about this minor issue later if it comes up.
val size = Files.size(path) | ||
if (size <= Int.MaxValue) { | ||
val fc = FileChannel.open(path, StandardOpenOption.READ) | ||
val bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in the CLI. Could this be done inside the API so that all applications benefit from it?
E.g, InputSourceDataInputStream(is) analyzes the input stream to see if it is a file and of the needed size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but I'm a little hesitant to force something on a API user if we can't say for sure it will be faster in 100% of cases, especially if there are cases where it could be slower (e.g. like with small files you mentioned).
Maybe an alternative might be to instead just provide better API documentation, maybe something like:
The InputStream variant has potential overhead due to streaming capabilities and support for unlimited data sizes. In some cases, better performance might come from using the ByteBuffer variant instead. For example, if your data is already in a byte array, one should use the Array[Byte] or ByteBuffer variants instead of wrapping it in a ByteArrayInputStream. As another example, instead of using a FileInputStream one could consider mapping the File to a MappedByteBuffer, keeping in mind that MappedByteBuffers might have different performance characteristics depending on the file size and system.
And then we leave it up to the API users to figure out what works best for their system/environment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add to that comment a link to the code lines in the CLI as an illustrative example of how to do it, or just put an example in the javadoc, and I agree that would be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Daffodil currently supports two different input sources: a BucketingInputSource backed by an InputStream and ByteBufferInputSource backed by a ByteBuffer. The CLI currntly always uses the BucketingInputSource because the ByteBufferInputSource does not support stdin or files larger than 2GB. Although the gap is closing due to other optimizations, the BucketingInputSource still has overhead compared to ByteBufferInputSource due to added complexity. This changes the CLI logic to use a ByteBufferInputSource where possible (parsing files <= 2GB) using mmap and a MappedByteBuffer to efficiently create a ByteBuffer. Basic testing shows about a 5% increase over the BucketingInputSource for a large file with many small reads. Also add Java/Scala API documentation explaining performance characterisics of the different input source construtors and example code for using mmap vs FileInputStream. DAFFODIL-2921
d6ef2ca
to
6ffcdb8
Compare
Daffodil currently supports two different input sources: a BucketingInputSource backed by an InputStream and ByteBufferInputSource backed by a ByteBuffer. The CLI currntly always uses the BucketingInputSource because the ByteBufferInputSource does not support stdin or files larger than 2GB. Although the gap is closing due to other optimizations, the BucketingInputSource still has overhead compared to ByteBufferInputSource due to added complexity.
This changes the CLI logic to use a ByteBufferInputSource where possible (parsing files <= 2GB) using mmap and a MappedByteBuffer to efficiently create a ByteBuffer.
Basic testing shows about a 5% increase over the BucketingInputSource for a large file with many small reads.
DAFFODIL-2921