Reduce number of unwanted very large downloads #1931

MattBlissett · 2024-04-18T14:07:15Z

Many of the very largest downloads (>100GB) are requested, but then never downloaded by the user. This is a significant waste of our resources and the user's time, and is probably frustrating for the user when they realize they cannot use GBIF data as they had hoped.

We have popup banners for large and most/all occurrence downloads, but since those were implemented cookie banners have spread all over the web, so users are even less likely to read them. I suggest instead changing the download page itself, providing different options and in some cases removing the existing DWCA and Simple options.

Ideas (more ideas was added later):

If the download is extremely large, e.g. more than half the total dataset, direct the user to the existing monthly downloads and the cloud-hosted monthly snapshots. Along with an encouragement to register a derived dataset later
For 500 million-row or larger filters, remove or disable the DWCA and Simple buttons, and instead provide the predicate for creating the download through the API.
- We already have a UI form for entering predicates and downloading through the UI, so this suggestion might not help much as it is quite easy and require no technical skills
In either case advise the user to add additional filters, perhaps directly. "You might add a filter for a taxon, location or date."
Always include information about creating a derived datasets if you do post filtering
Avoid large blocks of text which are now common in web popups (cookies etc), and format with some icons for Excel, R etc.
Do not mint DOI's unless the file is downloaded.
Force users to select between TEST_DOWNLOAD | STORE_COPY_FOR_CITATION
Allow users to get a random sample within a filter. The idea would be that perhaps they are mostly testing that they can work with the data?
Stop serving live data, always use a snapshot that is at most 1 month old. And then all downloads are simply a filter + a snapshot version reference (publishers can test data in UAT)

The text was updated successfully, but these errors were encountered:

CecSve · 2024-05-06T11:21:36Z

I think the following options are a good suggestions:

Force users to select between TEST_DOWNLOAD | STORE_COPY_FOR_CITATION

It should, however, be clarified what the difference is somehow.

Do not mint DOI's unless the file is downloaded. And delete file if not downloaded within 6 months.

Although it is a good idea, I am not sure how it could be coupled with the above suggestion.
How about we define a threshold and make a policy that we do not store data above this threshold for more than XX days/months unless the user actively requests us to?

If the download is extremely large, e.g. more than half the total dataset, direct the user to the existing monthly downloads and the cloud-hosted monthly snapshots. Along with an encouragement to register a derived dataset later

In either case advise the user to add additional filters, perhaps directly. "You might add a filter for a taxon, location or date."

Maybe we could again set a threshold of min. 3-5 filters applied or else this message appears?

Always include information about creating a derived datasets if you do post filtering

Yes, this would be helpful from a helpdesk perspective - it is one of the more common questions that pop up and it would be great if more users could be made aware of the option.

@ahahn-gbif and @timrobertson100 what do you think about the suggestions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce number of unwanted very large downloads #1931

Reduce number of unwanted very large downloads #1931

MattBlissett commented Apr 18, 2024 •

edited by MortenHofft

Loading

CecSve commented May 6, 2024 •

edited

Loading

Reduce number of unwanted very large downloads #1931

Reduce number of unwanted very large downloads #1931

Comments

MattBlissett commented Apr 18, 2024 • edited by MortenHofft Loading

CecSve commented May 6, 2024 • edited Loading

MattBlissett commented Apr 18, 2024 •

edited by MortenHofft

Loading

CecSve commented May 6, 2024 •

edited

Loading