Post processing of URLs during Deep Crawling. #1673

Assylbek-the-qt · 2025-12-16T07:14:27Z

Assylbek-the-qt
Dec 16, 2025

Hello, Guys! I have a question that I haven't found an answer to in docs. So I have a problem I need to deep crawl a website and of course it contains some internal links to other pages of that website. But the problem is for the website I must scrape the URL checking is very strange (at least for me idk). So https://example.com/pages and https://example.com/pages/ are 2 different URLS and it doesn't forward to the URL with slash at the end(but it only contains slashless URLs). So my question is: Is there a native way to append the '/' to the end of the URL? It would really be helpful for me

hafezparast · 2026-03-27T12:15:08Z

hafezparast
Mar 27, 2026
Sponsor

There's no built-in URL transform hook — FilterChain only accepts/rejects URLs, it can't rewrite them. But we tested a js_code workaround and it works.

The idea: inject JavaScript that adds trailing slashes to all internal links on every page before crawl4ai extracts them. The deep crawler then follows the rewritten URLs.

Tested against a local server where /page and /page/ return different content:

Approach	`/products`	`/about`	`/contact`
Without fix	WRONG content	WRONG content	WRONG content
With `js_code` fix	CORRECT (`/products/`)	CORRECT (`/about/`)	CORRECT (`/contact/`)

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=2),
    js_code="""
        document.querySelectorAll('a[href]').forEach(a => {
            try {
                const url = new URL(a.href, window.location.origin);
                if (url.origin === window.location.origin && !url.pathname.endsWith('/')) {
                    url.pathname += '/';
                    a.href = url.toString();
                }
            } catch(e) {}
        });
    """,
)

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun("https://your-site.com/", config=config)

This runs on every page before link extraction, so the deep crawler sees and follows the trailing-slash versions.

A native url_transform callback in the deep crawl pipeline would be a cleaner long-term solution — right now you can only filter URLs (accept/reject), not modify them.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Post processing of URLs during Deep Crawling. #1673

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Post processing of URLs during Deep Crawling. #1673

Uh oh!

Uh oh!

Assylbek-the-qt Dec 16, 2025

Replies: 1 comment

Uh oh!

hafezparast Mar 27, 2026 Sponsor

Assylbek-the-qt
Dec 16, 2025

hafezparast
Mar 27, 2026
Sponsor