Skip to content

Conversation

Copilot
Copy link

@Copilot Copilot AI commented Sep 23, 2025

Adds support for fsspec optimization options in read_parquet() to improve performance when reading Parquet files from remote storage systems (S3, GCS, HTTP, etc.). This change enables nested-pandas to work seamlessly with the fsspec optimization described in the NVIDIA blog post and implemented in LSDB PR #1030.

Problem

When LSDB attempts to pass fsspec optimization options to nested-pandas via the open_file_options parameter, the call fails because nested-pandas doesn't handle this parameter:

# This pattern from LSDB currently fails:
kwargs = {
    'open_file_options': {
        'precache_options': {'method': 'parquet'}
    }
}
df = nested_pandas.read_parquet(path, **kwargs)
# TypeError: read_table() got an unexpected keyword argument 'open_file_options'

The issue occurs because pyarrow.parquet.read_table() doesn't accept open_file_options directly - these options need to be applied at the filesystem level.

Solution

This PR modifies nested-pandas to use fsspec.parquet.open_parquet_file for optimized remote storage access:

  1. Uses fsspec.parquet: For remote URLs, uses fsspec.parquet.open_parquet_file with storage_options for better performance
  2. Intelligent path detection: Only applies fsspec optimization for remote storage (S3, HTTPS, GCS), bypasses for local files
  3. Graceful fallback: If fsspec optimization fails or isn't available, falls back to standard PyArrow reading
  4. Maintain backward compatibility: All existing code continues to work unchanged

Key Features

  • LSDB compatibility: Accepts open_file_options exactly as LSDB provides them
  • Smart routing: Automatically detects remote vs local files and applies appropriate reading method
  • Performance optimization: Uses fsspec.parquet.open_parquet_file for remote storage with precaching support
  • Error resilience: Handles cases where fsspec.parquet isn't available or optimization fails
  • Benchmarking support: Added benchmark comparison to measure performance improvements

Usage Examples

import nested_pandas as npd

# Basic fsspec optimization for remote files
df = npd.read_parquet(
    "s3://bucket/file.parquet",
    open_file_options={"precache_options": {"method": "parquet"}}
)

# Combined with other options
df = npd.read_parquet(
    "https://example.com/data.parquet", 
    columns=["col1", "col2"],
    open_file_options={
        "precache_options": {"method": "parquet"},
        "block_size": 64 * 1024 * 1024
    }
)

# Local files use standard PyArrow (no optimization needed)
df = npd.read_parquet("local_file.parquet")  # No changes needed

# Optimization automatically bypassed for local files even with options
df = npd.read_parquet("local_file.parquet", 
                     open_file_options={"precache_options": {"method": "parquet"}})

Implementation Details

  • Modified read_parquet() to use fsspec.parquet.open_parquet_file for remote storage optimization
  • Added _should_use_fsspec_optimization() to detect when optimization should be applied
  • Added _read_with_fsspec_optimization() to handle optimized reading with graceful fallback
  • Enhanced documentation to describe the new functionality
  • Added comprehensive test coverage for all scenarios
  • Added benchmark class ReadFewColumnsHTTPSWithOptimization to compare performance

Testing

All existing tests continue to pass, ensuring no regressions. New tests cover:

  • Basic open_file_options functionality with fsspec.parquet
  • Path detection for remote vs local files
  • Integration with other parameters
  • Backward compatibility verification
  • Graceful fallback when fsspec.parquet unavailable
  • LSDB usage pattern compatibility

Benchmarking

Added ReadFewColumnsHTTPSWithOptimization benchmark class to compare performance between standard PyArrow reading and fsspec-optimized reading for remote storage.

This change enables nested-pandas to benefit from fsspec's optimization features for remote storage while maintaining full compatibility with existing code and providing intelligent routing based on data source type.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@Copilot Copilot AI changed the title [WIP] Allow nested-pandas to use optimized fsspec when opening a parquet file. Change nested-pandas to work correctly with the code staged in this PR <@astronomy-commons/lsdb/pull/1030>, accepting the kwargs it expects for optimized I/O, in line with the gui... Add fsspec optimization support via open_file_options parameter Sep 24, 2025
@Copilot Copilot AI requested a review from gitosaurus September 24, 2025 00:11
Copilot finished work on behalf of gitosaurus September 24, 2025 00:11
@hombit
Copy link
Collaborator

hombit commented Sep 24, 2025

I don't like the idea of the keyword argument. We need to implement it through fsspec.parquet.open_parquet_file and benchmark with the current implementation (some benchmarks are already enabled in the repo)

@hombit hombit removed the request for review from gitosaurus September 24, 2025 14:06
@gitosaurus
Copy link
Contributor

@copilot This fails at runtime with the error

TypeError: AioSession.__init__() got an unexpected keyword argument 'precache_options'

which appears to arise from here:

File ~/code/nested-pandas/src/nested_pandas/nestedframe/io.py:109, in read_parquet(data, columns, reject_nesting, autocast_list, **kwargs)                                                                  
    107         filesystem = _apply_open_file_options_to_filesystem(filesystem, open_file_options, data)                                                                                                    
    108         kwargs["filesystem"] = filesystem                                                     
--> 109     table = pq.read_table(data, columns=columns, **kwargs)                                    
    110 # Otherwise convert with a special function                                                   
    111 else:                                      
    112     data, filesystem = _transform_read_parquet_data_arg(data, open_file_options)              

and ultimately fails here:

File ~/.conda/envs/dtj1s-lsdb/lib/python3.12/site-packages/s3fs/core.py:525, in S3FileSystem.set_session(self, refresh, kwargs)                                                                             
    523 conf = AioConfig(**config_kwargs)          
    524 if self.session is None:                   
--> 525     self.session = aiobotocore.session.AioSession(**self.kwargs)                              
    527 for parameters in (config_kwargs, self.kwargs, init_kwargs, client_kwargs):                   
    528     for option in ("region_name", "endpoint_url"):                                            

TypeError: AioSession.__init__() got an unexpected keyword argument 'precache_options'                

Evidently the keyword argument needs to be removed before it is applied to that point in the stack.

Copy link
Author

Copilot AI commented Sep 25, 2025

I don't like the idea of the keyword argument. We need to implement it through fsspec.parquet.open_parquet_file and benchmark with the current implementation (some benchmarks are already enabled in ...

I've implemented the solution using fsspec.parquet.open_parquet_file as suggested. The new approach:

  1. Uses fsspec.parquet: Replaced the keyword argument approach with fsspec.parquet.open_parquet_file for remote storage optimization
  2. Smart routing: Only applies fsspec optimization for remote URLs (S3, HTTPS, GCS, etc.), falls back to standard PyArrow for local files
  3. Benchmarking added: Added ReadFewColumnsHTTPSWithOptimization benchmark class to compare performance with/without optimization
  4. Same interface: Still accepts open_file_options parameter for LSDB compatibility, but processes it properly through fsspec

The implementation is more robust and addresses the runtime errors while providing better performance for remote storage. Commit: 0e1eb74

@copilot This fails at runtime with the error

TypeError: AioSession.__init__() got an unexpected keyword argument 'precache_options'

...

Fixed! The issue was that precache_options was being passed through the filesystem creation chain to AioSession.__init__() which doesn't accept it.

I've completely rewritten the approach to use fsspec.parquet.open_parquet_file instead of trying to inject options into filesystem constructors. The new implementation:

  1. Proper parameter handling: Uses fsspec.parquet.open_parquet_file with storage_options parameter instead of passing through to PyArrow directly
  2. No more parameter leakage: Options are handled at the fsspec level, not passed down to unsupported constructors
  3. Graceful fallback: If fsspec optimization fails, falls back to standard PyArrow reading
  4. Smart detection: Only applies optimization for remote storage, bypasses for local files

All existing tests pass and the runtime error is resolved. Commit: 0e1eb74

@Copilot Copilot AI changed the title Add fsspec optimization support via open_file_options parameter Add fsspec optimization support using fsspec.parquet for remote storage Sep 25, 2025
@Copilot Copilot AI requested a review from gitosaurus September 25, 2025 18:30
Copilot finished work on behalf of gitosaurus September 25, 2025 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants