Post processing of URLs during Deep Crawling. #1673
Replies: 1 comment
-
|
There's no built-in URL transform hook — The idea: inject JavaScript that adds trailing slashes to all internal links on every page before crawl4ai extracts them. The deep crawler then follows the rewritten URLs. Tested against a local server where
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=2),
js_code="""
document.querySelectorAll('a[href]').forEach(a => {
try {
const url = new URL(a.href, window.location.origin);
if (url.origin === window.location.origin && !url.pathname.endsWith('/')) {
url.pathname += '/';
a.href = url.toString();
}
} catch(e) {}
});
""",
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun("https://your-site.com/", config=config)This runs on every page before link extraction, so the deep crawler sees and follows the trailing-slash versions. A native |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, Guys! I have a question that I haven't found an answer to in docs. So I have a problem I need to deep crawl a website and of course it contains some internal links to other pages of that website. But the problem is for the website I must scrape the URL checking is very strange (at least for me idk). So https://example.com/pages and https://example.com/pages/ are 2 different URLS and it doesn't forward to the URL with slash at the end(but it only contains slashless URLs). So my question is: Is there a native way to append the '/' to the end of the URL? It would really be helpful for me
Beta Was this translation helpful? Give feedback.
All reactions