Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document memory-efficient use of capture_http #187

Open
jcushman opened this issue Jan 16, 2025 · 1 comment
Open

Document memory-efficient use of capture_http #187

jcushman opened this issue Jan 16, 2025 · 1 comment

Comments

@jcushman
Copy link

Since requests.get loads the whole response into RAM, capture_http is inefficient for large files:

# this uses 10GB of RAM:
with capture_http('example.warc.gz'):
    requests.get('https://example.com/#some_10GB_file')

Simply calling requests.get with stream=True doesn't result in archiving the file, only the headers:

# this doesn't work:
with capture_http('example.warc.gz'):
    requests.get('https://example.com/#some_10GB_file', stream=True)

Fetching and throwing away the data does work:

with capture_http('example.warc.gz'):
    response = requests.get('https://example.com/#some_10GB_file', stream=True)
    for _ in response.iter_content(chunk_size=2**16):
        pass

I haven't dug into why it's necessary to consume the stream -- is that inherent or accidental?

Either way, it might be nice to add one of these examples to the docs, since I think it's correct to do this any time you're not actually using the requests.get() response.

@ikreymer
Copy link
Member

Yes, I think this is mostly a documentation issue. capture_http() intercepts and writes what is actually loaded over the HTTP connection in the background.

with capture_http('example.warc.gz'):
    response = requests.get('https://example.com/#some_10GB_file', stream=True)
    for _ in response.iter_content(chunk_size=2**16):
        pass

Yes, I believe this is the correct way to capture a large file that you're not trying to use / shouldn't load into memory.

I haven't dug into why it's necessary to consume the stream -- is that inherent or accidental?

This is inherent, if you don't consume the stream, you could also partially consume it, etc... The capture_http() is sort of a low-level interception of the network traffic.

Either way, it might be nice to add one of these examples to the docs, since I think it's correct to do this any time you're not actually using the requests.get() response.

Yeah, that's fair, I suppose a wrap could be added that does the above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants