How to aggressively optimize for speed (Target < 5s) even with content loss? #1578
Replies: 1 comment
-
|
Your [FETCH] bottleneck is likely the page load wait. Here's the most aggressive speed config: browser_config = BrowserConfig(
headless=True,
text_mode=True, # disables images at browser level
light_mode=True, # disables background features
avoid_ads=True, # blocks ad/tracker requests
avoid_css=True, # blocks CSS loading
)
config = CrawlerRunConfig(
wait_until="commit", # fastest — don't wait for DOM to finish
page_timeout=5000, # hard 5s cutoff for navigation
delay_before_return_html=0, # no extra delay
simulate_user=False,
remove_overlay_elements=False,
# Don't use magic=True — it adds stealth overhead
)The key changes from your config:
If you still need dynamic content (JS-rendered) but faster, consider |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello! I am trying to optimize scraping speed as much as possible for my use case.
My Problem
Currently, my [FETCH] time for a page is around 10-11 seconds, even with aggressive resource blocking. My [SCRAPE] time is fast (1-2 seconds).
Running for only one link. But I still use async with AsyncWebCrawler
Here is a log example:
[FETCH]... ↓ https://www.localeclectic.com/products/example | ✓ | ⏱: 10.39s
[SCRAPE].. ◆ https://www.localeclectic.com/products/example | ✓ | ⏱: 1.12s
[COMPLETE] ● https://www.localeclectic.com/products/example | ✓ | ⏱: 11.52s
My Goal
My hard requirement is to get a response under 5 seconds.
The most important point: I am willing to sacrifice content accuracy. But it is important that the scraper should work with dynamic pages
My Configuration
Here is the configuration I am using:
JSON
{
"browser": {
"headers": { "Accept-Language": "en-US,en;q=0.9" },
"user_agent_mode": "random",
"enable_stealth": true,
"headless": true,
"browser_mode": "dedicated"
},
"run": {
"magic": true,
"simulate_user": false,
"override_navigator": true,
"remove_overlay_elements": false,
"page_timeout": 10000,
"delay_before_return_html": 0.1,
"exclude_all_images": true,
"markdown_generator": { "content_source": "raw_html", "options": { "ignore_links": true } }
},
"block_resource_types": ["image", "font", "media"],
"block_hosts": []
}
My Questions
Given my goal is speed above all, what is the recommended "fastest possible" configuration?
Is there a way to set a hard global timeout of 5 seconds for the entire [FETCH] operation? Any tips or life hacks would be very helpful!
I want to interrupt the entire proces 5 seconds.
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions