Document memory-efficient use of capture_http #187

jcushman · 2025-01-16T19:24:23Z

Since requests.get loads the whole response into RAM, capture_http is inefficient for large files:

# this uses 10GB of RAM:
with capture_http('example.warc.gz'):
    requests.get('https://example.com/#some_10GB_file')

Simply calling requests.get with stream=True doesn't result in archiving the file, only the headers:

# this doesn't work:
with capture_http('example.warc.gz'):
    requests.get('https://example.com/#some_10GB_file', stream=True)

Fetching and throwing away the data does work:

with capture_http('example.warc.gz'):
    response = requests.get('https://example.com/#some_10GB_file', stream=True)
    for _ in response.iter_content(chunk_size=2**16):
        pass

I haven't dug into why it's necessary to consume the stream -- is that inherent or accidental?

Either way, it might be nice to add one of these examples to the docs, since I think it's correct to do this any time you're not actually using the requests.get() response.

The text was updated successfully, but these errors were encountered:

ikreymer · 2025-01-16T20:56:42Z

Yes, I think this is mostly a documentation issue. capture_http() intercepts and writes what is actually loaded over the HTTP connection in the background.

with capture_http('example.warc.gz'):
    response = requests.get('https://example.com/#some_10GB_file', stream=True)
    for _ in response.iter_content(chunk_size=2**16):
        pass

Yes, I believe this is the correct way to capture a large file that you're not trying to use / shouldn't load into memory.

I haven't dug into why it's necessary to consume the stream -- is that inherent or accidental?

This is inherent, if you don't consume the stream, you could also partially consume it, etc... The capture_http() is sort of a low-level interception of the network traffic.

Either way, it might be nice to add one of these examples to the docs, since I think it's correct to do this any time you're not actually using the requests.get() response.

Yeah, that's fair, I suppose a wrap could be added that does the above

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document memory-efficient use of capture_http #187

Document memory-efficient use of capture_http #187

jcushman commented Jan 16, 2025

ikreymer commented Jan 16, 2025

Document memory-efficient use of capture_http #187

Document memory-efficient use of capture_http #187

Comments

jcushman commented Jan 16, 2025

ikreymer commented Jan 16, 2025