-
Notifications
You must be signed in to change notification settings - Fork 762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"RIS already open for ToeThread..." exception during https pages crawl over proxy #191
Comments
I can confirm this. The exception is thrown only for HTTPS hosts, plain HTTP works fine with a proxy. What's worse though, as soon as Heritrix encounters an HTTPS URL it runs into a -404 ""Empty HTTP response interpreted as a 404" error. (This may be coincidence, but the correlation looks suspicious enough to me.) This could be related to iipc/webarchive-commons#64 where @kris-sigur hinted at a possible cause:
Looking at the source code I have to admit though that I have no idea where this happens (or if this is in fact the cause of this behaviour), so I cannot offer you a bugfix ... Would be great if someone else can! 😄 Thanks, |
Any update about this? I am just facing the same problem. I notice several problems here: CONNECT command problemI noticed that Heritrix/HttpClient is sending the CONNECT command wrongly and some proxies don't accept it. I tried with Warcprox and Charles proxy and both complain about it. Changing the ROUTE_PLANNER in FetchHTTPRequest to specify the HttpHost port instead of passing -1 value solves this problem, the CONNECT command is sent in the right way then. The RIS already open problem.What I concluded is that while opening a TUNNEL with HTTPS the HttpClient will call the getSocketInputStream() 2 times, wrapping a java.net.SocketInputStream first and then wrapping a sun.security.ssl.AppInputStream. There is no way here Heritrix can know about this behaviour since its delegating the connection operations to the HttpClient. Also if I try to properly close the java.net.SocketInputStream before wrapping the sun.security.ssl.AppInputStream it will then complain that the Socket is closed when it tries to write. |
When I try to crawl https pages over a proxy with Heritrix 3, I get following exceptions:
java.io.IOException: RIS already open for ToeThread #5: https://www.XXX/robots.txt at org.archive.io.RecordingInputStream.open(RecordingInputStream.java:84) at org.archive.util.Recorder.inputWrap(Recorder.java:185) at org.archive.modules.fetcher.FetchHTTPRequest$RecordingHttpClientConnection.getSocketInputStream(FetchHTTPRequest.java:648) at org.apache.http.impl.BHttpConnectionBase.ensureOpen(BHttpConnectionBase.java:131) at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestHeader(DefaultBHttpClientConnection.java:140) at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:203) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:254) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.archive.modules.fetcher.FetchHTTPRequest.execute(FetchHTTPRequest.java:751) at org.archive.modules.fetcher.FetchHTTP.innerProcess(FetchHTTP.java:658) at org.archive.modules.Processor.innerProcessResult(Processor.java:175) at org.archive.modules.Processor.process(Processor.java:142) at org.archive.modules.ProcessorChain.process(ProcessorChain.java:138) at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)
The text was updated successfully, but these errors were encountered: