-
Notifications
You must be signed in to change notification settings - Fork 1
Add fsspec optimization support using fsspec.parquet for remote storage #369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dtj-parquet-io
Are you sure you want to change the base?
Add fsspec optimization support using fsspec.parquet for remote storage #369
Conversation
Co-authored-by: gitosaurus <[email protected]>
I don't like the idea of the keyword argument. We need to implement it through |
@copilot This fails at runtime with the error
which appears to arise from here:
and ultimately fails here:
Evidently the keyword argument needs to be removed before it is applied to that point in the stack. |
Co-authored-by: gitosaurus <[email protected]>
I've implemented the solution using
The implementation is more robust and addresses the runtime errors while providing better performance for remote storage. Commit: 0e1eb74
... Fixed! The issue was that I've completely rewritten the approach to use
All existing tests pass and the runtime error is resolved. Commit: 0e1eb74 |
Adds support for fsspec optimization options in
read_parquet()
to improve performance when reading Parquet files from remote storage systems (S3, GCS, HTTP, etc.). This change enables nested-pandas to work seamlessly with the fsspec optimization described in the NVIDIA blog post and implemented in LSDB PR #1030.Problem
When LSDB attempts to pass fsspec optimization options to nested-pandas via the
open_file_options
parameter, the call fails because nested-pandas doesn't handle this parameter:The issue occurs because
pyarrow.parquet.read_table()
doesn't acceptopen_file_options
directly - these options need to be applied at the filesystem level.Solution
This PR modifies nested-pandas to use
fsspec.parquet.open_parquet_file
for optimized remote storage access:fsspec.parquet.open_parquet_file
withstorage_options
for better performanceKey Features
open_file_options
exactly as LSDB provides themfsspec.parquet.open_parquet_file
for remote storage with precaching supportUsage Examples
Implementation Details
read_parquet()
to usefsspec.parquet.open_parquet_file
for remote storage optimization_should_use_fsspec_optimization()
to detect when optimization should be applied_read_with_fsspec_optimization()
to handle optimized reading with graceful fallbackReadFewColumnsHTTPSWithOptimization
to compare performanceTesting
All existing tests continue to pass, ensuring no regressions. New tests cover:
open_file_options
functionality with fsspec.parquetBenchmarking
Added
ReadFewColumnsHTTPSWithOptimization
benchmark class to compare performance between standard PyArrow reading and fsspec-optimized reading for remote storage.This change enables nested-pandas to benefit from fsspec's optimization features for remote storage while maintaining full compatibility with existing code and providing intelligent routing based on data source type.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.