-
Notifications
You must be signed in to change notification settings - Fork 56
Guide on optimizing Parquet file reading to reduce memory usage #594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
adamreeve
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice start thanks @haianhng31
Can you also please add a link to the new guide as well as the visitor pattern one to the list at the bottom of index.md?
|
|
||
| APIs for reading Parquet files: | ||
| 1. **LogicalColumnReader API** - Column-oriented reading with type-safe access | ||
| 2. **Arrow API (FileReader)** - Row-oriented reading using Apache Arrow's in-memory format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Arrow format is still column-oriented
| 2. **Arrow API (FileReader)** - Row-oriented reading using Apache Arrow's in-memory format | |
| 2. **Arrow API (FileReader)** - Reading using Apache Arrow's in-memory format |
|
|
||
| Each API offers different memory management options that impact memory usage. | ||
|
|
||
| ## Memory Configuration Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should include a section on the buffered stream parameter (ReaderProperties.EnableBufferedStream). Maybe this could be combined with the Buffer Size section as the buffer size is only used when the buffered stream is enabled? It would be helpful to also link to the documentation for the relevant methods for setting each parameter.
| ### 1. Buffer Size | ||
| Controls the size of I/O buffers used when reading from disk or streams. | ||
|
|
||
| **Default**: 8 MB (8,388,608 bytes) when using default file reading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default is actually 16384, where did you get 8 MB from?
| **Impact**: Larger buffers reduce I/O operations but increase memory usage. Smaller buffers are more memory-efficient but may decrease throughput. | ||
|
|
||
| ### 2. Chunked Reading | ||
| Instead of loading entire columns into memory, read data in smaller chunks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could do with some clarification. Is this referring to using the LogicalColumnReader API and controlling buffer/chunk sizes yourself?
| **Impact**: Pre-buffering can significantly increase memory usage as it loads data from future row groups before they're needed. This is the primary cause of memory usage scaling with file size reported in Apache Arrow [issue #46935](https://github.com/apache/arrow/issues/46935). | ||
|
|
||
| ### 4. Cache (Arrow API Only) | ||
| The Arrow API uses an internal `ReadRangeCache` that stores buffers for column chunks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be merged into number 3, as the cache options only apply when using pre-buffering and are used to configure the pre-buffering behaviour.
| for (int col = 0; col < metadata.NumColumns; col++) | ||
| { | ||
| using var columnReader = rowGroupReader.Column(col); | ||
| using var logicalReader = columnReader.LogicalReader<float>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth pointing out that ParquetSharp has its own buffering in the LogicalReader API, and this can be configured with the bufferLength parameter of this method.
| { | ||
| // Use a buffered stream with custom buffer size (1 MB in this example) | ||
| using var fileStream = File.OpenRead(filePath); | ||
| using var bufferedStream = new BufferedStream(fileStream, bufferSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By using a buffered stream, I actually meant enabling it in the ReaderProperties with ReaderProperties.EnableBufferedStream.
In previous investigations I've found that this can significantly reduce memory usage.
I don't think using a .NET System.IO.BufferedStream will change memory usage characteristics much.
| - **Columns**: 10 float columns | ||
| - **Rows**: 100 million (1 million per row group) | ||
| - **Compression**: Snappy | ||
| - **Test System**: MacBook (*Note: real-world performance may vary depending on your Operating System, environment*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
No description provided.