Description
A common problem when scraping scientific journal articles is if access to the PDF or other file downloads is not granted, but the server returns a 200 OK
status and sends an HTML document telling the user they don't have access. In this case, a scraperJSON client will simply download the HTML page and may rename it to the user's specified filename, which leads to a confusing situation where an HTML document might be mislabelled as some other filetype.
A solution is to allow a download
to specify one or more content-types that are permitted, or perhaps a regex that should match the content-type. If the content-type does not match, the download is skipped.
The client would implement this by performing a HEAD
request to the download URL initially, then evaluating the Content-Type
HTTP header, then deciding whether to proceed to full download.