feature: download only if content-type matches

A common problem when scraping scientific journal articles is if access to the PDF or other file downloads is not granted, but the server returns a `200 OK` status and sends an HTML document telling the user they don't have access. In this case, a scraperJSON client will simply download the HTML page and may rename it to the user's specified filename, which leads to a confusing situation where an HTML document might be mislabelled as some other filetype.

A solution is to allow a `download` to specify one or more content-types that are permitted, or perhaps a regex that should match the content-type. If the content-type does not match, the download is skipped.

The client would implement this by performing a `HEAD` request to the download URL initially, then evaluating the `Content-Type` HTTP header, then deciding whether to proceed to full download.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature: download only if content-type matches #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feature: download only if content-type matches #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions