-
-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[Bug]: downloaded_files race condition causes cross-contamination between CrawlResults #1889
Description
Summary
When accept_downloads=True and crawling multiple URLs (via arun with a deep crawl strategy, or arun_many), the downloaded_files list on a CrawlResult can contain files triggered by a previous URL's navigation. A slow download started by URL A completes after URL B begins processing, causing the file to appear on URL B's result instead of (or in addition to) URL A's.
AsyncPlaywrightCrawlerStrategy._downloaded_files is a mutable list on self. The lifecycle is:
-
Reset at the top of
_crawl_web(line ~525):self._downloaded_files = []
-
Populated by a fire-and-forget download listener registered per-page (line ~658):
page.on( "download", lambda download: asyncio.create_task( self._handle_download(download) ), )
-
Read at the end of
_crawl_webwhen building the response (line ~1059):downloaded_files=( self._downloaded_files if self._downloaded_files else None ),
The problem is step 2: _handle_download is an asyncio.create_task — it runs concurrently. When a download is slow (large file, slow server), the sequence becomes:
URL A navigation starts → browser triggers download for URL A
URL A _crawl_web returns → downloaded_files = [] (download still in progress)
URL B _crawl_web starts → self._downloaded_files = [] (reset)
URL A's download completes → _handle_download appends to self._downloaded_files
URL B _crawl_web returns → downloaded_files = [URL_A's_file] ← WRONG
The file that belongs to URL A ends up on URL B's CrawlResult.downloaded_files.
In batch/deep-crawl mode the page is reused across URLs, and the download listener's closure captures self (not a snapshot of the list), so late-completing downloads always append to whichever self._downloaded_files is current at completion time.
Compounding factor: duplicate file references
Because the list is reset but the listener persists, the same downloaded file can also appear on multiple subsequent CrawlResult objects if another download event fires (or the listener is re-registered without removing the old one).
Impact
Any consumer that trusts CrawlResult.downloaded_files to identify what document a given URL produced will misattribute content. In our use case (grant document scraping), this caused:
- PDF content attributed to an
.aspxHTML page - XLSX scoring template content attributed to a PDF URL
- HTML pages misclassified as binary documents
- Extracted text from the wrong document entirely, silently corrupting downstream data
Current Behavior
Suggested fix
The root cause is that _downloaded_files is shared mutable state on self with no synchronization between the fire-and-forget download task and the per-URL crawl lifecycle.
Option A: Scope downloads per-URL with a unique token
Give each _crawl_web invocation a unique ID. Pass it into _handle_download via the closure. Store downloads in a dict[str, list[str]] keyed by invocation ID. Read only the current invocation's list when building the response.
# In _crawl_web:
crawl_id = uuid4().hex
self._downloads_by_crawl[crawl_id] = []
page.on(
"download",
lambda download, _id=crawl_id: asyncio.create_task(
self._handle_download(download, _id)
),
)
# ... later:
downloaded_files = self._downloads_by_crawl.pop(crawl_id, []) or NoneOption B: Await pending downloads before returning
After page processing completes but before building the AsyncCrawlResponse, await any in-flight download tasks. This ensures that downloads triggered by URL A are captured on URL A's result, not URL B's.
# Track download tasks instead of fire-and-forget
task = asyncio.create_task(self._handle_download(download))
self._pending_downloads.append(task)
# Before returning AsyncCrawlResponse:
if self._pending_downloads:
await asyncio.gather(*self._pending_downloads, return_exceptions=True)
self._pending_downloads.clear()Option C: Attach downloads to the navigation request
Associate each download with the navigation that triggered it via Playwright's Download.page property and cross-reference with the current page URL. This is the most precise but requires careful handling of redirects.
Our workaround
We implemented three guards in our post-crawl processing layer:
seen_downloadsset — accumulate processed file paths across the stream; skip files already seen.- HTML page guard — if the page loaded successfully with non-empty markdown, ignore any
downloaded_files(they're stale noise from a prior URL). - Extension match guard — reject downloaded files whose extension doesn't match the URL's expected document type (e.g., an
.xlsxfile on a.pdfURL).
These heuristics work but are imperfect — they can't handle the case where two consecutive URLs are both documents. A proper fix in crawl4ai would eliminate the need for downstream workarounds entirely.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
1. Configure a crawl with `accept_downloads=True` and a `BestFirstCrawlStrategy` (or any multi-URL crawl).
2. Include at least one URL that triggers a browser download (e.g., a direct link to a large XLSX or PDF — large enough that the download takes >100ms).
3. Include at least one subsequent URL that is a normal HTML page.
4. Observe that the HTML page's `CrawlResult.downloaded_files` contains the file from the previous URL.
Minimal repro sketch:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def main():
browser_config = BrowserConfig(
accept_downloads=True,
downloads_path="/tmp/crawl4ai-downloads",
)
run_config = CrawlerRunConfig()
async with AsyncWebCrawler(config=browser_config) as crawler:
# URL A: triggers a slow download
result_a = await crawler.arun("https://example.com/large-file.xlsx", config=run_config)
# URL B: normal HTML page, crawled immediately after
result_b = await crawler.arun("https://example.com/normal-page.html", config=run_config)
print(f"A downloaded_files: {result_a.downloaded_files}") # Often None (download hadn't finished)
print(f"B downloaded_files: {result_b.downloaded_files}") # Contains large-file.xlsx!
asyncio.run(main())
The race is timing-dependent, so it may not reproduce on every run — it depends on download speed relative to the next URL's navigation time. It is **very** reproducible in deep-crawl/streaming mode where URLs are processed back-to-back.Code snippets
OS
Ubuntu 24.04
Python version
3.12
Browser
Chromium
Browser version
143.0.7499.4
Error logs & Screenshots (if applicable)
No response