Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent handling of headers in storage_options for pandas read methods #295

Closed
Schwarzam opened this issue Jun 17, 2024 · 0 comments · Fixed by #296
Closed

Inconsistent handling of headers in storage_options for pandas read methods #295

Schwarzam opened this issue Jun 17, 2024 · 0 comments · Fixed by #296
Assignees
Labels
bug Something isn't working

Comments

@Schwarzam
Copy link
Contributor

I encountered an issue when trying to use JWT authentication with pandas file read methods, such as read_parquet. The problem arises due to the different ways headers need to be specified for HTTP(S) URLs when using pandas and fsspec.

Typically, the header for a request with JWT authentication looks like this:

{
    "headers": {"Authorization": "Token XXXXXXX"}
}

When accessing files, storage_options is used to send these headers. However, there is an inconsistency in how pandas and fsspec handle these headers. While fsspec expects the storage options to include the "headers" key as shown above, pandas expects the key-value pairs to be forwarded directly as header options without the "headers" key

According to the pandas.read_parquet (applies to all read methods) documentation:
For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options.

Thus, for HTTP connections, the storage options should be formatted as follows:

{
    "Authorization": "Token XXXXXXX"
}

This discrepancy causes errors in pandas read methods on (file_io.py), such as read_parquet_file_to_pandas and ``.

Suggested Solution

To resolve this issue, I suggest this lines before methods that reads using pandas to correct just the headers in the storage_options. Here's the suggested code snippet:

if storage_options is not None and "headers" in storage_options:
    headers = storage_options.pop("headers")
    storage_options = {**storage_options, **headers}

I tested this locally and it works.
Don't know if its better to create a function to not repeat the pattern.

@Schwarzam Schwarzam added the bug Something isn't working label Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants