Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3.x related changes #4

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open

Conversation

atedstone
Copy link
Contributor

Hi Jose

Having used the script for the last few days I've come across a couple more oddities related to Python 3. There are two commits (plus some merging as I didn't have your remote linked to the repo so didn't catch your latest changes) - one deals with handling timeout errors that I've been having, the other deals with the different method of handling content-length in Python 3.

I had a quick go on Py 2.7, seems ok.

cheers
Andrew

@jgomezdans
Copy link
Owner

Hi Andrew,

Many thanks for this. Will try and test tomorrow to ensure it still works on 2.7 (yes, yes, yes, I need to move on...) ;-)

Cheers!

@atedstone
Copy link
Contributor Author

Ah didn't realise it added the latest commits from my fork to the pull request. I forgot to pull your upstream version before making more edits, then got into all sorts of problems, hence the revert to your version in 46c89bc. So at the moment the script is only working with Py27. I'm going to re-implement the Py35 stuff now so wait for that before doing anything yourself...

…. This allows the script to work with Py 3.x.
@atedstone
Copy link
Contributor Author

As per commit - see http://docs.python-requests.org/en/master/user/quickstart/#response-content for more information - with this small edit the script works with Py35 again.

@atedstone
Copy link
Contributor Author

still having problems with timeouts despite commit 16ccf82. If I manage to fix I'll post a commit here

…ent the checking of local vs remote file sizes which was lost during transition to HTTP authentication. In theory this checking should not be needed if the session reconnection and timeout logic works properly, but I have implemented as I have lots of missing gaps in data downloaded to date
…ile size. Instead, script now downloads files to filenames suffixed .part, then removes the suffix once the download has completed. -quick/ruff option removed, --check_sizes added but runs with False as default. Should only need to be used on legacy directories (i.e. where there might be corrupt files not named .part).
1) Removed excess logic (enabling loss of two indent levels)
2) Implemented direct login to earthdata auth service at start of session, remains persistent. This has fixed previous problems associated with lots of 401: access denied requests which were resulting in script stalling for tens of session attempts before eventually connecting.
3) Disabled flush and fsync commands as these were slowing down processing.
@atedstone
Copy link
Contributor Author

I've pushed all my recent changes onto my master branch, have a look at whether you want to merge. All of these changes have been to try to get to the bottom of why the script has been so slow at downloading for me. As of this morning's commits I appear to have solved the speed problems: (1) the existing authentication process was revealed to be very flaky once I started checking the response headers; (2) flush and fpsync don't appear to be needed when using the requests module stream=True option. I'm now downloading a tile a minute, much better than a tile per 30 mins or so as was worst case previously.

atedstone and others added 7 commits November 15, 2016 17:47
…ine (enables download from e.g. NSIDC), 2. convert return_url to use authenticated session to obtain list of dates (required for NSIDC MODIS data) and now parse these dates with beautifulsoup.
…mplement ability to choose data provider as a simplified way of setting baseurl.
Copy link
Owner

@jgomezdans jgomezdans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for your efforts, I have written a few comments below...

with requests.Session() as s:
s.mount(base_url, requests.adapters.HTTPAdapter(max_retries=5))

## Login to the EarthData Service for this session
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this authenticity token needed? I think I just used a session with no issues not that long ago... Or is it needed for the download to work with NSIDC (I seem to remember it wasn't necessary either...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this code about three years ago so memory is a bit hazy, but my assumption is that it must have been necessary then otherwise I wouldn't have gone to the trouble of GETing then POSTing the auth token.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the login process is the same whether USGS or NSIDC - they both go through the EarthData login service)

@@ -10,7 +10,7 @@ Description

This repository contains a Python script (and executable) that allows one to download MODIS data granules for different products and periods.

The code is quite simple and generic, and should work with most standard Python installations.
The code is quite simple and generic but requires at least a couple of specific libraries not usually available by default: Beautiful Soup (conda-forge: `bs4`) and lxml (conda-forge: `lxml`).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm usually wary of adding dependencies for simple tasks such as scanning a few links. I see it simplifies some of the code somewhat, but... I'm already wary of using requests as I couldn't get urllib to work with redirection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned html is relatively complicated, especially for doing the earthdata login, so this seems a reasonable compromise to me.

# along with brdf_filter. If not, see <http://www.gnu.org/licenses/>.
import optparse

import argparse
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted!

fp.write(chunk)
fp.flush()
os.fsync(fp)
dates_available = None
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have another script that scans folders in parallel, gathers all the URLs and then downloads them in parallel (you're only allowed a couple of slots for download, but scanning is much faster). Maybe worth considering...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the info - could be useful. At the moment I'm working primarily with MOD10A1. The product sizes are small enough that I don't really need to worry about parallelising download.

@atedstone
Copy link
Contributor Author

I've diverged quite a long way from your implementation now ... so I offer my changes should you want them, but no problem if not! I'm using them all my end so they are serving their purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants