Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement URL Validation for Improved Data Access #47

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

manjirigunaji
Copy link
Collaborator

@manjirigunaji manjirigunaji commented Jan 22, 2024

This pull request introduces a new module, validation_util.py, to handle URL validation within the National Water Model (NWM) utility scripts. The validation_util module includes a check_valid_urls function that validates URLs and filters out any invalid ones, reducing dependencies on tqdm. Additionally, URL validation functionality is to enhance the reliability of accessing data files in the project.

Changes Made:

  • Added a new module 'validation.py' containing functions for checking the validity of URLs.
  • Integrated URL validation into the 'urlgen.py' module to ensure that only valid URLs are generated for data access.
  • Implemented asynchronous URL validation using the 'gevent' library to improve performance and efficiency.

Tested various scenarios including valid URLs, invalid URLs, and edge cases to verify the robustness of the validation.

@jameshalgren
Copy link
Member

Please post a few results from tests demonstrating the new function.

Copy link
Member

@jameshalgren jameshalgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!
Please

  • correct the accidental deletions.
  • add the string to the test readme to obtain the test output indicated (also, please put this same information in the PR thread to facilitate review.)
  • consider including a standard unit test for the validation script.

Thanks!

@karnesh Please watch for this updated PR.

@@ -150,75 +189,15 @@ def select_lead_time(lead_time=None, default=None):
9: "https://ciroh-nwm-zarr-copy.s3.amazonaws.com/national-water-model/",
}

retrospective_var_types = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These deletions look accidental.

@@ -0,0 +1,25 @@
Here's an example of the test case output:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the command line string to obtain this test?

Copy link
Collaborator

@karnesh karnesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating the PR! @manjirigunaji

  • Please clean up the code and remove same functions in urlgennwm.py and validation_util.py.
  • Please post few results demonstrating the new function.
  • It would be better to have a unit test instead of hard coded test - this can be either done now or in an another PR.

Thanks!

valid_file_list = [gevent.spawn(check_url_part, file_name) for file_name in file_list]
gevent.joinall(valid_file_list)
return [file.get() for file in valid_file_list if file.get() is not None]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is also in validation_util.py and imported here as well. Can you please clean up the code and have the function at one place?

t.set_description(f"Not Found: {filename}")
t.update(1)
t.refresh()
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to check_valid_urls function, can you please clean up the code and have the function at one place?

generate_urls(start_date, end_date, fcst_cycle, lead_time, varinput, geoinput, runinput, urlbaseinput, meminput)
# Example usage
file_list = create_file_list(runinput, varinput, geoinput, meminput, start_date, end_date, fcst_cycle, urlbaseinput, lead_time)
valid_files = check_valid_urls(file_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please post about the hard-coded test in the main PR message with some kind of ### Test heading?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test
I have included a testing demo image that could be beneficial.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above is an image that is the filenamelist.txt.
image

@manjirigunaji manjirigunaji changed the title Add validation_util module for URL validation and update urlgennwm.py #43 Implement URL Validation for Improved Data Access Feb 15, 2024
Copy link
Collaborator

@karnesh karnesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR @manjirigunaji. Please change the hard coded test to a unit test in another PR. @sepehrkrz Please feel free to merge the PR.

Copy link
Collaborator

@karnesh karnesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good and this PR can be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants