Skip to content

[Bug]: downloaded_files race condition causes cross-contamination between CrawlResults #1889

@bkennedy-improving

Description

@bkennedy-improving

Summary

When accept_downloads=True and crawling multiple URLs (via arun with a deep crawl strategy, or arun_many), the downloaded_files list on a CrawlResult can contain files triggered by a previous URL's navigation. A slow download started by URL A completes after URL B begins processing, causing the file to appear on URL B's result instead of (or in addition to) URL A's.

AsyncPlaywrightCrawlerStrategy._downloaded_files is a mutable list on self. The lifecycle is:

  1. Reset at the top of _crawl_web (line ~525):

    self._downloaded_files = []
  2. Populated by a fire-and-forget download listener registered per-page (line ~658):

    page.on(
        "download",
        lambda download: asyncio.create_task(
            self._handle_download(download)
        ),
    )
  3. Read at the end of _crawl_web when building the response (line ~1059):

    downloaded_files=(
        self._downloaded_files if self._downloaded_files else None
    ),

The problem is step 2: _handle_download is an asyncio.create_task — it runs concurrently. When a download is slow (large file, slow server), the sequence becomes:

URL A navigation starts → browser triggers download for URL A
URL A _crawl_web returns → downloaded_files = [] (download still in progress)
URL B _crawl_web starts → self._downloaded_files = [] (reset)
URL A's download completes → _handle_download appends to self._downloaded_files
URL B _crawl_web returns → downloaded_files = [URL_A's_file]  ← WRONG

The file that belongs to URL A ends up on URL B's CrawlResult.downloaded_files.

In batch/deep-crawl mode the page is reused across URLs, and the download listener's closure captures self (not a snapshot of the list), so late-completing downloads always append to whichever self._downloaded_files is current at completion time.

Compounding factor: duplicate file references

Because the list is reset but the listener persists, the same downloaded file can also appear on multiple subsequent CrawlResult objects if another download event fires (or the listener is re-registered without removing the old one).

Impact

Any consumer that trusts CrawlResult.downloaded_files to identify what document a given URL produced will misattribute content. In our use case (grant document scraping), this caused:

  • PDF content attributed to an .aspx HTML page
  • XLSX scoring template content attributed to a PDF URL
  • HTML pages misclassified as binary documents
  • Extracted text from the wrong document entirely, silently corrupting downstream data

Current Behavior

Suggested fix

The root cause is that _downloaded_files is shared mutable state on self with no synchronization between the fire-and-forget download task and the per-URL crawl lifecycle.

Option A: Scope downloads per-URL with a unique token

Give each _crawl_web invocation a unique ID. Pass it into _handle_download via the closure. Store downloads in a dict[str, list[str]] keyed by invocation ID. Read only the current invocation's list when building the response.

# In _crawl_web:
crawl_id = uuid4().hex
self._downloads_by_crawl[crawl_id] = []

page.on(
    "download",
    lambda download, _id=crawl_id: asyncio.create_task(
        self._handle_download(download, _id)
    ),
)

# ... later:
downloaded_files = self._downloads_by_crawl.pop(crawl_id, []) or None

Option B: Await pending downloads before returning

After page processing completes but before building the AsyncCrawlResponse, await any in-flight download tasks. This ensures that downloads triggered by URL A are captured on URL A's result, not URL B's.

# Track download tasks instead of fire-and-forget
task = asyncio.create_task(self._handle_download(download))
self._pending_downloads.append(task)

# Before returning AsyncCrawlResponse:
if self._pending_downloads:
    await asyncio.gather(*self._pending_downloads, return_exceptions=True)
    self._pending_downloads.clear()

Option C: Attach downloads to the navigation request

Associate each download with the navigation that triggered it via Playwright's Download.page property and cross-reference with the current page URL. This is the most precise but requires careful handling of redirects.

Our workaround

We implemented three guards in our post-crawl processing layer:

  1. seen_downloads set — accumulate processed file paths across the stream; skip files already seen.
  2. HTML page guard — if the page loaded successfully with non-empty markdown, ignore any downloaded_files (they're stale noise from a prior URL).
  3. Extension match guard — reject downloaded files whose extension doesn't match the URL's expected document type (e.g., an .xlsx file on a .pdf URL).

These heuristics work but are imperfect — they can't handle the case where two consecutive URLs are both documents. A proper fix in crawl4ai would eliminate the need for downstream workarounds entirely.

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

1. Configure a crawl with `accept_downloads=True` and a `BestFirstCrawlStrategy` (or any multi-URL crawl).
2. Include at least one URL that triggers a browser download (e.g., a direct link to a large XLSX or PDF — large enough that the download takes >100ms).
3. Include at least one subsequent URL that is a normal HTML page.
4. Observe that the HTML page's `CrawlResult.downloaded_files` contains the file from the previous URL.

Minimal repro sketch:


import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    browser_config = BrowserConfig(
        accept_downloads=True,
        downloads_path="/tmp/crawl4ai-downloads",
    )
    run_config = CrawlerRunConfig()

    async with AsyncWebCrawler(config=browser_config) as crawler:
        # URL A: triggers a slow download
        result_a = await crawler.arun("https://example.com/large-file.xlsx", config=run_config)
        # URL B: normal HTML page, crawled immediately after
        result_b = await crawler.arun("https://example.com/normal-page.html", config=run_config)

        print(f"A downloaded_files: {result_a.downloaded_files}")  # Often None (download hadn't finished)
        print(f"B downloaded_files: {result_b.downloaded_files}")  # Contains large-file.xlsx!

asyncio.run(main())


The race is timing-dependent, so it may not reproduce on every run — it depends on download speed relative to the next URL's navigation time. It is **very** reproducible in deep-crawl/streaming mode where URLs are processed back-to-back.

Code snippets

OS

Ubuntu 24.04

Python version

3.12

Browser

Chromium

Browser version

143.0.7499.4

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions