You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since requests.get loads the whole response into RAM, capture_http is inefficient for large files:
# this uses 10GB of RAM:
with capture_http('example.warc.gz'):
requests.get('https://example.com/#some_10GB_file')
Simply calling requests.get with stream=True doesn't result in archiving the file, only the headers:
# this doesn't work:
with capture_http('example.warc.gz'):
requests.get('https://example.com/#some_10GB_file', stream=True)
Fetching and throwing away the data does work:
with capture_http('example.warc.gz'):
response = requests.get('https://example.com/#some_10GB_file', stream=True)
for _ in response.iter_content(chunk_size=2**16):
pass
I haven't dug into why it's necessary to consume the stream -- is that inherent or accidental?
Either way, it might be nice to add one of these examples to the docs, since I think it's correct to do this any time you're not actually using the requests.get() response.
The text was updated successfully, but these errors were encountered:
Yes, I think this is mostly a documentation issue. capture_http() intercepts and writes what is actually loaded over the HTTP connection in the background.
with capture_http('example.warc.gz'):
response = requests.get('https://example.com/#some_10GB_file', stream=True)
for _ in response.iter_content(chunk_size=2**16):
pass
Yes, I believe this is the correct way to capture a large file that you're not trying to use / shouldn't load into memory.
I haven't dug into why it's necessary to consume the stream -- is that inherent or accidental?
This is inherent, if you don't consume the stream, you could also partially consume it, etc... The capture_http() is sort of a low-level interception of the network traffic.
Either way, it might be nice to add one of these examples to the docs, since I think it's correct to do this any time you're not actually using the requests.get() response.
Yeah, that's fair, I suppose a wrap could be added that does the above
Since
requests.get
loads the whole response into RAM, capture_http is inefficient for large files:Simply calling requests.get with stream=True doesn't result in archiving the file, only the headers:
Fetching and throwing away the data does work:
I haven't dug into why it's necessary to consume the stream -- is that inherent or accidental?
Either way, it might be nice to add one of these examples to the docs, since I think it's correct to do this any time you're not actually using the
requests.get()
response.The text was updated successfully, but these errors were encountered: