Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTPS via a Proxy #64

Open
PsypherPunk opened this issue Aug 24, 2016 · 1 comment
Open

HTTPS via a Proxy #64

PsypherPunk opened this issue Aug 24, 2016 · 1 comment

Comments

@PsypherPunk
Copy link

I've trying to crawl a HTTPS site through a Squid proxy and keep seeing errors like these:

java.io.IOException: RIS already open for ToeThread #12: https://XXX/robots.txt
   at org.archive.io.RecordingInputStream.open(RecordingInputStream.java:84)
   at org.archive.util.Recorder.inputWrap(Recorder.java:185)
   at org.archive.modules.fetcher.FetchHTTPRequest$RecordingHttpClientConnection.getSocketInputStream(FetchHTTPRequest.java:649)
   at org.apache.http.impl.BHttpConnectionBase.ensureOpen(BHttpConnectionBase.java:131)

HTTP sites are fine but HTTPS just doesn't seem to work. The problem seems to be down to the RecordingInputStream and RecordingOutputStream, both of which throw an IOException if the underlying Stream is != null.

If, however, I comment out those checks, the HTTPS crawl works perfectly (as far as I can tell...). I'm not sure whether this is the webarchive-commons library being overly cautious or heritrix3 failing to do something for HTTPS sites.

@kris-sigur
Copy link
Member

First thought is that when crawling HTTPS via proxy, Heritrix fails to properly close the RecordingInputStream (these are thread local).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants