[Bug]: `downloaded_files` race condition causes cross-contamination between CrawlResults


## Summary

When `accept_downloads=True` and crawling multiple URLs (via `arun` with a deep crawl strategy, or `arun_many`), the `downloaded_files` list on a `CrawlResult` can contain files triggered by a **previous** URL's navigation. A slow download started by URL A completes after URL B begins processing, causing the file to appear on URL B's result instead of (or in addition to) URL A's.

`AsyncPlaywrightCrawlerStrategy._downloaded_files` is a mutable list on `self`. The lifecycle is:

1. **Reset** at the top of `_crawl_web` (line ~525):
   ```python
   self._downloaded_files = []
   ```

2. **Populated** by a fire-and-forget download listener registered per-page (line ~658):
   ```python
   page.on(
       "download",
       lambda download: asyncio.create_task(
           self._handle_download(download)
       ),
   )
   ```

3. **Read** at the end of `_crawl_web` when building the response (line ~1059):
   ```python
   downloaded_files=(
       self._downloaded_files if self._downloaded_files else None
   ),
   ```

The problem is step 2: `_handle_download` is an `asyncio.create_task` — it runs concurrently. When a download is slow (large file, slow server), the sequence becomes:

```
URL A navigation starts → browser triggers download for URL A
URL A _crawl_web returns → downloaded_files = [] (download still in progress)
URL B _crawl_web starts → self._downloaded_files = [] (reset)
URL A's download completes → _handle_download appends to self._downloaded_files
URL B _crawl_web returns → downloaded_files = [URL_A's_file]  ← WRONG
```

The file that belongs to URL A ends up on URL B's `CrawlResult.downloaded_files`.

In batch/deep-crawl mode the page is reused across URLs, and the download listener's closure captures `self` (not a snapshot of the list), so late-completing downloads always append to whichever `self._downloaded_files` is current at completion time.

### Compounding factor: duplicate file references

Because the list is reset but the listener persists, the same downloaded file can also appear on **multiple** subsequent `CrawlResult` objects if another download event fires (or the listener is re-registered without removing the old one).

## Impact

Any consumer that trusts `CrawlResult.downloaded_files` to identify what document a given URL produced will misattribute content. In our use case (grant document scraping), this caused:

- PDF content attributed to an `.aspx` HTML page
- XLSX scoring template content attributed to a PDF URL
- HTML pages misclassified as binary documents
- Extracted text from the wrong document entirely, silently corrupting downstream data




### Current Behavior

## Suggested fix

The root cause is that `_downloaded_files` is shared mutable state on `self` with no synchronization between the fire-and-forget download task and the per-URL crawl lifecycle.

### Option A: Scope downloads per-URL with a unique token

Give each `_crawl_web` invocation a unique ID. Pass it into `_handle_download` via the closure. Store downloads in a `dict[str, list[str]]` keyed by invocation ID. Read only the current invocation's list when building the response.

```python
# In _crawl_web:
crawl_id = uuid4().hex
self._downloads_by_crawl[crawl_id] = []

page.on(
    "download",
    lambda download, _id=crawl_id: asyncio.create_task(
        self._handle_download(download, _id)
    ),
)

# ... later:
downloaded_files = self._downloads_by_crawl.pop(crawl_id, []) or None
```

### Option B: Await pending downloads before returning

After page processing completes but before building the `AsyncCrawlResponse`, await any in-flight download tasks. This ensures that downloads triggered by URL A are captured on URL A's result, not URL B's.

```python
# Track download tasks instead of fire-and-forget
task = asyncio.create_task(self._handle_download(download))
self._pending_downloads.append(task)

# Before returning AsyncCrawlResponse:
if self._pending_downloads:
    await asyncio.gather(*self._pending_downloads, return_exceptions=True)
    self._pending_downloads.clear()
```

### Option C: Attach downloads to the navigation request

Associate each download with the navigation that triggered it via Playwright's `Download.page` property and cross-reference with the current page URL. This is the most precise but requires careful handling of redirects.

## Our workaround

We implemented three guards in our post-crawl processing layer:

1. **`seen_downloads` set** — accumulate processed file paths across the stream; skip files already seen.
2. **HTML page guard** — if the page loaded successfully with non-empty markdown, ignore any `downloaded_files` (they're stale noise from a prior URL).
3. **Extension match guard** — reject downloaded files whose extension doesn't match the URL's expected document type (e.g., an `.xlsx` file on a `.pdf` URL).

These heuristics work but are imperfect — they can't handle the case where two consecutive URLs are both documents. A proper fix in crawl4ai would eliminate the need for downstream workarounds entirely.


### Is this reproducible?

Yes

### Inputs Causing the Bug

```bash

```

### Steps to Reproduce

```bash
1. Configure a crawl with `accept_downloads=True` and a `BestFirstCrawlStrategy` (or any multi-URL crawl).
2. Include at least one URL that triggers a browser download (e.g., a direct link to a large XLSX or PDF — large enough that the download takes >100ms).
3. Include at least one subsequent URL that is a normal HTML page.
4. Observe that the HTML page's `CrawlResult.downloaded_files` contains the file from the previous URL.

Minimal repro sketch:


import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    browser_config = BrowserConfig(
        accept_downloads=True,
        downloads_path="/tmp/crawl4ai-downloads",
    )
    run_config = CrawlerRunConfig()

    async with AsyncWebCrawler(config=browser_config) as crawler:
        # URL A: triggers a slow download
        result_a = await crawler.arun("https://example.com/large-file.xlsx", config=run_config)
        # URL B: normal HTML page, crawled immediately after
        result_b = await crawler.arun("https://example.com/normal-page.html", config=run_config)

        print(f"A downloaded_files: {result_a.downloaded_files}")  # Often None (download hadn't finished)
        print(f"B downloaded_files: {result_b.downloaded_files}")  # Contains large-file.xlsx!

asyncio.run(main())


The race is timing-dependent, so it may not reproduce on every run — it depends on download speed relative to the next URL's navigation time. It is **very** reproducible in deep-crawl/streaming mode where URLs are processed back-to-back.
```

### Code snippets

```python

```

### OS

Ubuntu 24.04

### Python version

3.12

### Browser

Chromium

### Browser version

143.0.7499.4

### Error logs & Screenshots (if applicable)

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: `downloaded_files` race condition causes cross-contamination between CrawlResults #1889

Summary

Compounding factor: duplicate file references

Impact

Current Behavior

Suggested fix

Option A: Scope downloads per-URL with a unique token

Option B: Await pending downloads before returning

Option C: Attach downloads to the navigation request

Our workaround

Is this reproducible?

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

Python version

Browser

Browser version

Error logs & Screenshots (if applicable)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: downloaded_files race condition causes cross-contamination between CrawlResults #1889

Description

Summary

Compounding factor: duplicate file references

Impact

Current Behavior

Suggested fix

Option A: Scope downloads per-URL with a unique token

Option B: Await pending downloads before returning

Option C: Attach downloads to the navigation request

Our workaround

Is this reproducible?

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

Python version

Browser

Browser version

Error logs & Screenshots (if applicable)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: `downloaded_files` race condition causes cross-contamination between CrawlResults #1889