-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python 3.x related changes #4
base: master
Are you sure you want to change the base?
Conversation
Hi Andrew, Many thanks for this. Will try and test tomorrow to ensure it still works on 2.7 (yes, yes, yes, I need to move on...) ;-) Cheers! |
Ah didn't realise it added the latest commits from my fork to the pull request. I forgot to pull your upstream version before making more edits, then got into all sorts of problems, hence the revert to your version in 46c89bc. So at the moment the script is only working with Py27. I'm going to re-implement the Py35 stuff now so wait for that before doing anything yourself... |
…. This allows the script to work with Py 3.x.
As per commit - see http://docs.python-requests.org/en/master/user/quickstart/#response-content for more information - with this small edit the script works with Py35 again. |
still having problems with timeouts despite commit 16ccf82. If I manage to fix I'll post a commit here |
…ent the checking of local vs remote file sizes which was lost during transition to HTTP authentication. In theory this checking should not be needed if the session reconnection and timeout logic works properly, but I have implemented as I have lots of missing gaps in data downloaded to date
…ile size. Instead, script now downloads files to filenames suffixed .part, then removes the suffix once the download has completed. -quick/ruff option removed, --check_sizes added but runs with False as default. Should only need to be used on legacy directories (i.e. where there might be corrupt files not named .part).
1) Removed excess logic (enabling loss of two indent levels) 2) Implemented direct login to earthdata auth service at start of session, remains persistent. This has fixed previous problems associated with lots of 401: access denied requests which were resulting in script stalling for tens of session attempts before eventually connecting. 3) Disabled flush and fsync commands as these were slowing down processing.
I've pushed all my recent changes onto my master branch, have a look at whether you want to merge. All of these changes have been to try to get to the bottom of why the script has been so slow at downloading for me. As of this morning's commits I appear to have solved the speed problems: (1) the existing authentication process was revealed to be very flaky once I started checking the response headers; (2) flush and fpsync don't appear to be needed when using the requests module |
…ine (enables download from e.g. NSIDC), 2. convert return_url to use authenticated session to obtain list of dates (required for NSIDC MODIS data) and now parse these dates with beautifulsoup.
…mplement ability to choose data provider as a simplified way of setting baseurl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for your efforts, I have written a few comments below...
with requests.Session() as s: | ||
s.mount(base_url, requests.adapters.HTTPAdapter(max_retries=5)) | ||
|
||
## Login to the EarthData Service for this session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this authenticity token needed? I think I just used a session with no issues not that long ago... Or is it needed for the download to work with NSIDC (I seem to remember it wasn't necessary either...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this code about three years ago so memory is a bit hazy, but my assumption is that it must have been necessary then otherwise I wouldn't have gone to the trouble of GETing then POSTing the auth token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(the login process is the same whether USGS or NSIDC - they both go through the EarthData login service)
@@ -10,7 +10,7 @@ Description | |||
|
|||
This repository contains a Python script (and executable) that allows one to download MODIS data granules for different products and periods. | |||
|
|||
The code is quite simple and generic, and should work with most standard Python installations. | |||
The code is quite simple and generic but requires at least a couple of specific libraries not usually available by default: Beautiful Soup (conda-forge: `bs4`) and lxml (conda-forge: `lxml`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm usually wary of adding dependencies for simple tasks such as scanning a few links. I see it simplifies some of the code somewhat, but... I'm already wary of using requests
as I couldn't get urllib to work with redirection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The returned html is relatively complicated, especially for doing the earthdata login, so this seems a reasonable compromise to me.
# along with brdf_filter. If not, see <http://www.gnu.org/licenses/>. | ||
import optparse | ||
|
||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted!
fp.write(chunk) | ||
fp.flush() | ||
os.fsync(fp) | ||
dates_available = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have another script that scans folders in parallel, gathers all the URLs and then downloads them in parallel (you're only allowed a couple of slots for download, but scanning is much faster). Maybe worth considering...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the info - could be useful. At the moment I'm working primarily with MOD10A1. The product sizes are small enough that I don't really need to worry about parallelising download.
I've diverged quite a long way from your implementation now ... so I offer my changes should you want them, but no problem if not! I'm using them all my end so they are serving their purpose. |
Hi Jose
Having used the script for the last few days I've come across a couple more oddities related to Python 3. There are two commits (plus some merging as I didn't have your remote linked to the repo so didn't catch your latest changes) - one deals with handling timeout errors that I've been having, the other deals with the different method of handling content-length in Python 3.
I had a quick go on Py 2.7, seems ok.
cheers
Andrew