Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmap files when possible to improve CLI parse performance #1274

Merged

Conversation

stevedlawrence
Copy link
Member

Daffodil currently supports two different input sources: a BucketingInputSource backed by an InputStream and ByteBufferInputSource backed by a ByteBuffer. The CLI currntly always uses the BucketingInputSource because the ByteBufferInputSource does not support stdin or files larger than 2GB. Although the gap is closing due to other optimizations, the BucketingInputSource still has overhead compared to ByteBufferInputSource due to added complexity.

This changes the CLI logic to use a ByteBufferInputSource where possible (parsing files <= 2GB) using mmap and a MappedByteBuffer to efficiently create a ByteBuffer.

Basic testing shows about a 5% increase over the BucketingInputSource for a large file with many small reads.

DAFFODIL-2921

Copy link
Contributor

@pkatlic pkatlic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I just have some questions for discussion. As is the change is an improvement certainly.

// the BucketingInputSource. Larger files cannot be mapped so we cannot avoid it
val path = Paths.get(file)
val size = Files.size(path)
if (size <= Int.MaxValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc for the map method says:

For most operating systems, mapping a file into memory is more expensive than reading or writing a 
few tens of kilobytes of data via the usual read

So should there be a floor check also e.g., below some size we just read it into a byte buffer and avoid the map?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. Though, I imagine if you're parsing a small file with the CLI then the overhead of mmap is going to be relatively small compared to the overhead of starting up a JVM and maybe the it won't make a difference? I'm not sure. We can do some experiments to see if there's a benefit for smaller files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could leave this as you have it, and we can look at nightly performance stuff to see if it slows down noticably. Lots of those will use files that are small.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nightlies don't use the parse command so won't see any change. They use the performance command which reads test files into a byte array before testing to avoid overhead related to disk I/O.

We could create some patches that run on the nightlies, one patch change the performance command to use FileInputStream and one to use a MappedByteBuffer, which would give us an idea of mmap vs file input stream. But that's feels like a decent amount of work just to figure out an optimal size where mmap overhead > bucketing overhead. Also, based on my bucketing vs non-bucketing tests, I feel like bucketing overhead is probably more than mmap-overhead, even with small files and so we should always avoid bucketing when possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Then I suggest merge as is, and we worry (or not) about this minor issue later if it comes up.

val size = Files.size(path)
if (size <= Int.MaxValue) {
val fc = FileChannel.open(path, StandardOpenOption.READ)
val bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the CLI. Could this be done inside the API so that all applications benefit from it?

E.g, InputSourceDataInputStream(is) analyzes the input stream to see if it is a file and of the needed size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I'm a little hesitant to force something on a API user if we can't say for sure it will be faster in 100% of cases, especially if there are cases where it could be slower (e.g. like with small files you mentioned).

Maybe an alternative might be to instead just provide better API documentation, maybe something like:

The InputStream variant has potential overhead due to streaming capabilities and support for unlimited data sizes. In some cases, better performance might come from using the ByteBuffer variant instead. For example, if your data is already in a byte array, one should use the Array[Byte] or ByteBuffer variants instead of wrapping it in a ByteArrayInputStream. As another example, instead of using a FileInputStream one could consider mapping the File to a MappedByteBuffer, keeping in mind that MappedByteBuffers might have different performance characteristics depending on the file size and system.

And then we leave it up to the API users to figure out what works best for their system/environment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to that comment a link to the code lines in the CLI as an illustrative example of how to do it, or just put an example in the javadoc, and I agree that would be sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Daffodil currently supports two different input sources: a
BucketingInputSource backed by an InputStream and ByteBufferInputSource
backed by a ByteBuffer. The CLI currntly always uses the
BucketingInputSource because the ByteBufferInputSource does not support
stdin or files larger than 2GB. Although the gap is closing due to other
optimizations, the BucketingInputSource still has overhead compared to
ByteBufferInputSource due to added complexity.

This changes the CLI logic to use a ByteBufferInputSource where possible
(parsing files <= 2GB) using mmap and a MappedByteBuffer to efficiently
create a ByteBuffer.

Basic testing shows about a 5% increase over the BucketingInputSource
for a large file with many small reads.

Also add Java/Scala API documentation explaining performance
characterisics of the different input source construtors and example
code for using mmap vs FileInputStream.

DAFFODIL-2921
@stevedlawrence stevedlawrence force-pushed the daffodil-2921-cli-mmap-bytebuffer branch from d6ef2ca to 6ffcdb8 Compare August 21, 2024 16:15
@stevedlawrence stevedlawrence merged commit 67eef7e into apache:main Aug 21, 2024
11 checks passed
@stevedlawrence stevedlawrence deleted the daffodil-2921-cli-mmap-bytebuffer branch August 21, 2024 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants