Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix get_dataset_path() in fs_utils.py #663

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dongpohezui
Copy link

@dongpohezui dongpohezui commented Apr 10, 2021

version:
Windows 10
Python 3.7.0
petastorm 0.9.8
pyarrow 3.0.0

from petastorm import make_reader

reader = make_reader('file:///D:/test/test.parquet') 
print(reader)

The following error occurred when I ran the above code under Windows.

picture

The image above shows that there is an extra '/' at the beginning of the file path.
So I deleted the extra '/' .

@CLAassistant
Copy link

CLAassistant commented Apr 10, 2021

CLA assistant check
All committers have signed the CLA.

@codecov
Copy link

codecov bot commented Apr 10, 2021

Codecov Report

Patch coverage: 33.33% and project coverage change: -0.04% ⚠️

Comparison is base (10e0fc8) 85.22% compared to head (271da38) 85.19%.
Report is 70 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #663      +/-   ##
==========================================
- Coverage   85.22%   85.19%   -0.04%     
==========================================
  Files          85       85              
  Lines        4981     4984       +3     
  Branches      791      792       +1     
==========================================
+ Hits         4245     4246       +1     
- Misses        596      597       +1     
- Partials      140      141       +1     
Files Changed Coverage Δ
petastorm/fs_utils.py 90.00% <33.33%> (-1.76%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@selitvin
Copy link
Collaborator

Did you manage to track down what is the origin of this extra '/'? Which part of code added it? Asking this to make sure the same issue wouldn't pop up in some other scenarios with different code paths.

Also, can you please add a unittest covering this case?

@dongpohezui
Copy link
Author

The URL comes from user input.
Under Linux, it works just fine.
Under Windows, it makes mistakes.

In get_filesystem_and_path_or_paths(), URLparse () is used to handle URLs.

Under Windows, the file cannot be accessed via the file path "/D:/test/test.parquet ". Instead, it should be "D:/test/test.parquet".

In my opinion, you can only test this code under Windows.

@@ -33,6 +34,9 @@ def get_dataset_path(parsed_url):
# s3/gs/gcs filesystem expects paths of the form `bucket/path`
return parsed_url.netloc + parsed_url.path

if parsed_url.scheme.lower() in ['file'] and "Windows" in platform.system():
return parsed_url.path[1:]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a chance that this fix would break for file urls on network location (as per this article). Consider the example of the following WIndows url: file://hostname/path/to/the%20file.txt

@@ -33,6 +34,9 @@ def get_dataset_path(parsed_url):
# s3/gs/gcs filesystem expects paths of the form `bucket/path`
return parsed_url.netloc + parsed_url.path

if parsed_url.scheme.lower() in ['file'] and "Windows" in platform.system():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you could add a unit test verifying correctness of this function in petastorm/tests/test_fs_utils.py. You would need to mock the value returned by platform.system() and then I think the test logic should be straight forward.

If for some reason you can not do this, we'd be able to land this and I can help with the test later.

Copy link
Collaborator

@selitvin selitvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking the PR with "Request Changes" due to incorrect behavior in the case of windows urls that contain network and not local paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants