mmap files when possible to improve CLI parse performance #1274

stevedlawrence · 2024-07-31T17:19:47Z

Daffodil currently supports two different input sources: a BucketingInputSource backed by an InputStream and ByteBufferInputSource backed by a ByteBuffer. The CLI currntly always uses the BucketingInputSource because the ByteBufferInputSource does not support stdin or files larger than 2GB. Although the gap is closing due to other optimizations, the BucketingInputSource still has overhead compared to ByteBufferInputSource due to added complexity.

This changes the CLI logic to use a ByteBufferInputSource where possible (parsing files <= 2GB) using mmap and a MappedByteBuffer to efficiently create a ByteBuffer.

Basic testing shows about a 5% increase over the BucketingInputSource for a large file with many small reads.

DAFFODIL-2921

pkatlic

+1

mbeckerle

+1

I just have some questions for discussion. As is the change is an improvement certainly.

mbeckerle · 2024-08-21T14:21:01Z

daffodil-cli/src/main/scala/org/apache/daffodil/cli/Main.scala

+                // the BucketingInputSource. Larger files cannot be mapped so we cannot avoid it
+                val path = Paths.get(file)
+                val size = Files.size(path)
+                if (size <= Int.MaxValue) {


doc for the map method says:

For most operating systems, mapping a file into memory is more expensive than reading or writing a few tens of kilobytes of data via the usual read

So should there be a floor check also e.g., below some size we just read it into a byte buffer and avoid the map?

Possibly. Though, I imagine if you're parsing a small file with the CLI then the overhead of mmap is going to be relatively small compared to the overhead of starting up a JVM and maybe the it won't make a difference? I'm not sure. We can do some experiments to see if there's a benefit for smaller files.

Could leave this as you have it, and we can look at nightly performance stuff to see if it slows down noticably. Lots of those will use files that are small.

The nightlies don't use the parse command so won't see any change. They use the performance command which reads test files into a byte array before testing to avoid overhead related to disk I/O.

We could create some patches that run on the nightlies, one patch change the performance command to use FileInputStream and one to use a MappedByteBuffer, which would give us an idea of mmap vs file input stream. But that's feels like a decent amount of work just to figure out an optimal size where mmap overhead > bucketing overhead. Also, based on my bucketing vs non-bucketing tests, I feel like bucketing overhead is probably more than mmap-overhead, even with small files and so we should always avoid bucketing when possible.

ok. Then I suggest merge as is, and we worry (or not) about this minor issue later if it comes up.

mbeckerle · 2024-08-21T14:25:04Z

daffodil-cli/src/main/scala/org/apache/daffodil/cli/Main.scala

+                val size = Files.size(path)
+                if (size <= Int.MaxValue) {
+                  val fc = FileChannel.open(path, StandardOpenOption.READ)
+                  val bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, size)


This is in the CLI. Could this be done inside the API so that all applications benefit from it?

E.g, InputSourceDataInputStream(is) analyzes the input stream to see if it is a file and of the needed size?

We could, but I'm a little hesitant to force something on a API user if we can't say for sure it will be faster in 100% of cases, especially if there are cases where it could be slower (e.g. like with small files you mentioned).

Maybe an alternative might be to instead just provide better API documentation, maybe something like:

The InputStream variant has potential overhead due to streaming capabilities and support for unlimited data sizes. In some cases, better performance might come from using the ByteBuffer variant instead. For example, if your data is already in a byte array, one should use the Array[Byte] or ByteBuffer variants instead of wrapping it in a ByteArrayInputStream. As another example, instead of using a FileInputStream one could consider mapping the File to a MappedByteBuffer, keeping in mind that MappedByteBuffers might have different performance characteristics depending on the file size and system.

And then we leave it up to the API users to figure out what works best for their system/environment?

Add to that comment a link to the code lines in the CLI as an illustrative example of how to do it, or just put an example in the javadoc, and I agree that would be sufficient.

Daffodil currently supports two different input sources: a BucketingInputSource backed by an InputStream and ByteBufferInputSource backed by a ByteBuffer. The CLI currntly always uses the BucketingInputSource because the ByteBufferInputSource does not support stdin or files larger than 2GB. Although the gap is closing due to other optimizations, the BucketingInputSource still has overhead compared to ByteBufferInputSource due to added complexity. This changes the CLI logic to use a ByteBufferInputSource where possible (parsing files <= 2GB) using mmap and a MappedByteBuffer to efficiently create a ByteBuffer. Basic testing shows about a 5% increase over the BucketingInputSource for a large file with many small reads. Also add Java/Scala API documentation explaining performance characterisics of the different input source construtors and example code for using mmap vs FileInputStream. DAFFODIL-2921

pkatlic approved these changes Jul 31, 2024

View reviewed changes

mbeckerle approved these changes Aug 21, 2024

View reviewed changes

stevedlawrence force-pushed the daffodil-2921-cli-mmap-bytebuffer branch from d6ef2ca to 6ffcdb8 Compare August 21, 2024 16:15

stevedlawrence merged commit 67eef7e into apache:main Aug 21, 2024
11 checks passed

stevedlawrence deleted the daffodil-2921-cli-mmap-bytebuffer branch August 21, 2024 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmap files when possible to improve CLI parse performance #1274

mmap files when possible to improve CLI parse performance #1274

stevedlawrence commented Jul 31, 2024

pkatlic left a comment

mbeckerle left a comment

mbeckerle Aug 21, 2024

stevedlawrence Aug 21, 2024

mbeckerle Aug 21, 2024

stevedlawrence Aug 21, 2024

mbeckerle Aug 21, 2024

mbeckerle Aug 21, 2024

stevedlawrence Aug 21, 2024

mbeckerle Aug 21, 2024

stevedlawrence Aug 21, 2024

mmap files when possible to improve CLI parse performance #1274

mmap files when possible to improve CLI parse performance #1274

Conversation

stevedlawrence commented Jul 31, 2024

pkatlic left a comment

Choose a reason for hiding this comment

mbeckerle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment