Python 3.x related changes #4

atedstone · 2016-06-17T09:05:35Z

Hi Jose

Having used the script for the last few days I've come across a couple more oddities related to Python 3. There are two commits (plus some merging as I didn't have your remote linked to the repo so didn't catch your latest changes) - one deals with handling timeout errors that I've been having, the other deals with the different method of handling content-length in Python 3.

I had a quick go on Py 2.7, seems ok.

cheers
Andrew

jgomezdans · 2016-10-11T22:33:03Z

Hi Andrew,

Many thanks for this. Will try and test tomorrow to ensure it still works on 2.7 (yes, yes, yes, I need to move on...) ;-)

Cheers!

atedstone · 2016-10-12T08:27:21Z

Ah didn't realise it added the latest commits from my fork to the pull request. I forgot to pull your upstream version before making more edits, then got into all sorts of problems, hence the revert to your version in 46c89bc. So at the moment the script is only working with Py27. I'm going to re-implement the Py35 stuff now so wait for that before doing anything yourself...

…. This allows the script to work with Py 3.x.

atedstone · 2016-10-12T08:58:11Z

As per commit - see http://docs.python-requests.org/en/master/user/quickstart/#response-content for more information - with this small edit the script works with Py35 again.

atedstone · 2016-10-12T11:59:42Z

still having problems with timeouts despite commit 16ccf82. If I manage to fix I'll post a commit here

…ent the checking of local vs remote file sizes which was lost during transition to HTTP authentication. In theory this checking should not be needed if the session reconnection and timeout logic works properly, but I have implemented as I have lots of missing gaps in data downloaded to date

…ile size. Instead, script now downloads files to filenames suffixed .part, then removes the suffix once the download has completed. -quick/ruff option removed, --check_sizes added but runs with False as default. Should only need to be used on legacy directories (i.e. where there might be corrupt files not named .part).

1) Removed excess logic (enabling loss of two indent levels) 2) Implemented direct login to earthdata auth service at start of session, remains persistent. This has fixed previous problems associated with lots of 401: access denied requests which were resulting in script stalling for tens of session attempts before eventually connecting. 3) Disabled flush and fsync commands as these were slowing down processing.

atedstone · 2016-11-15T14:54:35Z

I've pushed all my recent changes onto my master branch, have a look at whether you want to merge. All of these changes have been to try to get to the bottom of why the script has been so slow at downloading for me. As of this morning's commits I appear to have solved the speed problems: (1) the existing authentication process was revealed to be very flaky once I started checking the response headers; (2) flush and fpsync don't appear to be needed when using the requests module stream=True option. I'm now downloading a tile a minute, much better than a tile per 30 mins or so as was worst case previously.

… files

…ine (enables download from e.g. NSIDC), 2. convert return_url to use authenticated session to obtain list of dates (required for NSIDC MODIS data) and now parse these dates with beautifulsoup.

…mplement ability to choose data provider as a simplified way of setting baseurl.

jgomezdans

Many thanks for your efforts, I have written a few comments below...

jgomezdans · 2020-03-19T15:48:41Z

get_modis.py

+            with requests.Session() as s:
+                s.mount(base_url, requests.adapters.HTTPAdapter(max_retries=5))
+
+                ## Login to the EarthData Service for this session


Is this authenticity token needed? I think I just used a session with no issues not that long ago... Or is it needed for the download to work with NSIDC (I seem to remember it wasn't necessary either...)?

I wrote this code about three years ago so memory is a bit hazy, but my assumption is that it must have been necessary then otherwise I wouldn't have gone to the trouble of GETing then POSTing the auth token.

(the login process is the same whether USGS or NSIDC - they both go through the EarthData login service)

jgomezdans · 2020-03-19T15:50:34Z

README.rst

@@ -10,7 +10,7 @@ Description

 This repository contains a Python script (and executable) that allows one to download MODIS data granules for different products and periods. 

-The code is quite simple and generic, and should work with most standard Python installations.
+The code is quite simple and generic but requires at least a couple of specific libraries not usually available by default: Beautiful Soup (conda-forge: `bs4`) and lxml (conda-forge: `lxml`).


I'm usually wary of adding dependencies for simple tasks such as scanning a few links. I see it simplifies some of the code somewhat, but... I'm already wary of using requests as I couldn't get urllib to work with redirection.

The returned html is relatively complicated, especially for doing the earthdata login, so this seems a reasonable compromise to me.

jgomezdans · 2020-03-19T15:50:49Z

get_modis.py

-# along with brdf_filter.  If not, see <http://www.gnu.org/licenses/>.
-import optparse
+
+import argparse


Well spotted!

jgomezdans · 2020-03-19T15:54:52Z

get_modis.py

-                        fp.write(chunk)
-                fp.flush()
-                os.fsync(fp)
+    dates_available = None


I have another script that scans folders in parallel, gathers all the URLs and then downloads them in parallel (you're only allowed a couple of slots for download, but scanning is much faster). Maybe worth considering...

thanks for the info - could be useful. At the moment I'm working primarily with MOD10A1. The product sizes are small enough that I don't really need to worry about parallelising download.

atedstone · 2020-03-23T15:04:07Z

I've diverged quite a long way from your implementation now ... so I offer my changes should you want them, but no problem if not! I'm using them all my end so they are serving their purpose.

atedstone added 9 commits June 17, 2016 09:35

add HTTPError, URLError and socket timeout handling to HDF file request

a314ffd

update content-length detection to work with Python 3.x

96d3286

merge upstream commits about sleeping with atedstone local commits

d80af6d

not all of merge was fixed properly, corrected

e077929

define timeout duration (only for Python 3.x)

4860512

merge jgmoezdans upstream changes

e9837ce

add timeout argument to Session get

2253ab4

revert to latest jgomezdans version with addition of timeouts

46c89bc

add max retries

16ccf82

Requests automatically decodes content (, so remove decode() function…

d7a7a10

…. This allows the script to work with Py 3.x.

atedstone added 5 commits October 12, 2016 13:39

first version of multi reconnection

0fd2688

update documentation, add copyright info

0f609c2

atedstone and others added 7 commits November 15, 2016 17:47

bugfix - tile criterion was missing from the check for existing local…

e31e268

… files

Two main changes in this commit: 1. allow choice of base URL at cmd l…

e3407bd

…ine (enables download from e.g. NSIDC), 2. convert return_url to use authenticated session to obtain list of dates (required for NSIDC MODIS data) and now parse these dates with beautifulsoup.

update readme with baseurl option

6686f1f

convert optparse (deprecated) to argparse. Also took opportunity to i…

ec17bc9

…mplement ability to choose data provider as a simplified way of setting baseurl.

correct USGS to https address; remove print statement used in debugging

4d21774

improve readme for revised version with argparse and multiple providers.

a70e728

fix readme typos

8328fbd

jgomezdans reviewed Mar 19, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python 3.x related changes #4

Python 3.x related changes #4

atedstone commented Jun 17, 2016

jgomezdans commented Oct 11, 2016

atedstone commented Oct 12, 2016

atedstone commented Oct 12, 2016

atedstone commented Oct 12, 2016

atedstone commented Nov 15, 2016

jgomezdans left a comment

jgomezdans Mar 19, 2020

atedstone Mar 23, 2020

atedstone Mar 23, 2020

jgomezdans Mar 19, 2020

atedstone Mar 23, 2020

jgomezdans Mar 19, 2020

jgomezdans Mar 19, 2020

atedstone Mar 23, 2020

atedstone commented Mar 23, 2020

Python 3.x related changes #4

Are you sure you want to change the base?

Python 3.x related changes #4

Conversation

atedstone commented Jun 17, 2016

jgomezdans commented Oct 11, 2016

atedstone commented Oct 12, 2016

atedstone commented Oct 12, 2016

atedstone commented Oct 12, 2016

atedstone commented Nov 15, 2016

jgomezdans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atedstone commented Mar 23, 2020