How to Scrape Pages That Require Scrolling? #991
Replies: 1 comment
-
|
The issue is that JS-heavy sites like recess.studio load content dynamically on scroll, so the HTML is empty at initial load. Here are a few approaches, from simplest to most flexible: Option 1: config = CrawlerRunConfig(
scan_full_page=True,
scroll_delay=0.5, # wait between scroll steps
delay_before_return_html=2.0, # wait after scrolling finishes
wait_until="networkidle", # wait for network to settle
)Option 2: Targeted JS scrolling (more control) config = CrawlerRunConfig(
js_code="""
await new Promise(r => {
let totalHeight = 0;
const distance = 300;
const timer = setInterval(() => {
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= document.body.scrollHeight) {
clearInterval(timer);
r();
}
}, 200);
});
""",
delay_before_return_html=2.0,
wait_until="domcontentloaded",
)Option 3: Per-URL configs with configs = []
for url in urls:
if needs_scrolling(url): # your logic
configs.append(CrawlerRunConfig(scan_full_page=True, scroll_delay=0.5))
else:
configs.append(CrawlerRunConfig())
results = await crawler.arun_many(urls=urls, config=configs)Also note: the param is |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Any tips for scraping first page content of addresses like this?
By default it returns an empty result in
result.markdownwhen running asynchronously.My goal with this scraper project is to scrape multiple addresses (not worried with anti-bot so much) that aren't really big websites. I foresee that I might run into a few problems however for pages that require a bit of scrolling. Unless all the HTML is available on page load before scrolling?
I've tried, btw, the
scroll_full_page=Truesetting on crawler configuration, but that also yields a few problems with other pages, breaking my flexibility and modularity of this application. I was getting the crawler to raise errors and exceptions completely by trying to mimic an user as in using settings likemagic=Trueorsimulate_user=True, and evenadjust_viewport_to_content=True.This is a snippet of my current script.
I'd do things like summarize the content with a LLM in the next step, but I'm failing to get any content at all during crawling, makes sense?
I'd love tips to be able to scrape as many URLs as I can. I already have a good success though, but this would give me a nice improvement.
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions