Skip to content

Conversation

@haianhng31
Copy link
Contributor

No description provided.

Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start thanks @haianhng31

Can you also please add a link to the new guide as well as the visitor pattern one to the list at the bottom of index.md?


APIs for reading Parquet files:
1. **LogicalColumnReader API** - Column-oriented reading with type-safe access
2. **Arrow API (FileReader)** - Row-oriented reading using Apache Arrow's in-memory format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Arrow format is still column-oriented

Suggested change
2. **Arrow API (FileReader)** - Row-oriented reading using Apache Arrow's in-memory format
2. **Arrow API (FileReader)** - Reading using Apache Arrow's in-memory format


Each API offers different memory management options that impact memory usage.

## Memory Configuration Parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should include a section on the buffered stream parameter (ReaderProperties.EnableBufferedStream). Maybe this could be combined with the Buffer Size section as the buffer size is only used when the buffered stream is enabled? It would be helpful to also link to the documentation for the relevant methods for setting each parameter.

### 1. Buffer Size
Controls the size of I/O buffers used when reading from disk or streams.

**Default**: 8 MB (8,388,608 bytes) when using default file reading
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

**Impact**: Larger buffers reduce I/O operations but increase memory usage. Smaller buffers are more memory-efficient but may decrease throughput.

### 2. Chunked Reading
Instead of loading entire columns into memory, read data in smaller chunks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could do with some clarification. Is this referring to using the LogicalColumnReader API and controlling buffer/chunk sizes yourself?

**Impact**: Pre-buffering can significantly increase memory usage as it loads data from future row groups before they're needed. This is the primary cause of memory usage scaling with file size reported in Apache Arrow [issue #46935](https://github.com/apache/arrow/issues/46935).

### 4. Cache (Arrow API Only)
The Arrow API uses an internal `ReadRangeCache` that stores buffers for column chunks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be merged into number 3, as the cache options only apply when using pre-buffering and are used to configure the pre-buffering behaviour.

for (int col = 0; col < metadata.NumColumns; col++)
{
using var columnReader = rowGroupReader.Column(col);
using var logicalReader = columnReader.LogicalReader<float>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth pointing out that ParquetSharp has its own buffering in the LogicalReader API, and this can be configured with the bufferLength parameter of this method.

{
// Use a buffered stream with custom buffer size (1 MB in this example)
using var fileStream = File.OpenRead(filePath);
using var bufferedStream = new BufferedStream(fileStream, bufferSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using a buffered stream, I actually meant enabling it in the ReaderProperties with ReaderProperties.EnableBufferedStream.

In previous investigations I've found that this can significantly reduce memory usage.

I don't think using a .NET System.IO.BufferedStream will change memory usage characteristics much.

- **Columns**: 10 float columns
- **Rows**: 100 million (1 million per row group)
- **Compression**: Snappy
- **Test System**: MacBook (*Note: real-world performance may vary depending on your Operating System, environment*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants