[Feature Request]: Document-aware content processing in the crawl pipeline #1890

bkennedy-improving · 2026-03-31T22:32:34Z

bkennedy-improving
Mar 31, 2026

What needs to be done?

crawl4ai's pipeline assumes every URL yields HTML that can be parsed by ContentScrapingStrategy and converted to markdown by MarkdownGenerationStrategy. When a URL points to a binary document (PDF, DOCX, XLSX, etc.), the pipeline either fails silently (empty markdown) or produces garbage output from the browser's built-in viewer chrome. There is no hook for users to intercept document URLs and route them through an appropriate extraction backend (e.g., Kreuzberg, PyMuPDF, Docling) before the content scraping phase.

Current behavior

What happens when crawl4ai navigates to a document URL

There are three outcomes depending on the document type and browser behavior:

1. Browser triggers a download (ERR_ABORTED)

For file types the browser can't render (XLSX, DOCX, DOC, XLS, PPTX, etc.), Chromium aborts navigation and triggers a download event. crawl4ai catches the net::ERR_ABORTED error and sets response = None:

# async_crawler_strategy.py, line ~710
except Error as e:
    if 'net::ERR_ABORTED' in str(e) and self.browser_config.accept_downloads:
        # Navigation aborted, likely due to file download
        response = None

The CrawlResult has success=False (or success=True with empty HTML), and the file path appears in downloaded_files. The ContentScrapingStrategy receives an empty HTML string — it produces no useful output.

2. Browser renders inline (Chrome PDF viewer)

For PDFs, Chromium renders them in its built-in viewer. The page.content() call returns the PDF viewer's HTML wrapper — not the PDF's text content. The ContentScrapingStrategy and MarkdownGenerationStrategy then process this viewer chrome, producing markdown like:

# Chrome PDF Viewer
[zoom in] [zoom out] [rotate] ...

The actual PDF text is inaccessible through the DOM.

3. Navigation fails entirely

For some server configurations, navigating to a document URL returns an error (e.g., the server sends Content-Disposition: attachment without a renderable response). The CrawlResult has success=False with no HTML and no downloaded_files.

The pipeline gap

The current pipeline is:

URL → Browser Navigation → page.content() → HTML
                                                ↓
                            ContentScrapingStrategy.scrap(url, html)
                                                ↓
                            MarkdownGenerationStrategy.generate_markdown(...)
                                                ↓
                                          CrawlResult

There is no step between navigation and content scraping where the pipeline can:

Detect that the response is a binary document (via Content-Type headers, downloaded_files, URL extension, or ERR_ABORTED).
Extract text from the document using an appropriate backend.
Produce a CrawlResult with meaningful markdown content.

Users who need to crawl sites that mix HTML pages and document links must implement all of this detection and extraction logic after receiving the CrawlResult — effectively reimplementing a parallel content pipeline outside of crawl4ai.

Real-world impact

Government grant portals (our use case) are a representative example: a single grant listing page links to PDFs (funding announcements), XLSX files (application forms, scoring templates), DOCX files (guidelines), and HTML pages (FAQs, contact info). All of these need to be crawled and their text extracted.

With the current architecture, we had to build a document_aware_stream wrapper that:

Intercepts each CrawlResult from the crawl4ai stream.
Checks three signals: downloaded_files, response Content-Type headers, and URL extension + failed navigation.
Downloads the document via Playwright's APIRequestContext (for signals 2–3).
Extracts text via Kreuzberg (a Python library that handles PDF, DOCX, XLSX, PPTX, DOC, XLS, etc.).
Wraps the result in a DocumentAwareResult dataclass that normalizes the interface for both HTML and document results.

This works but has significant downsides:

The detection heuristics are fragile and duplicate knowledge about crawl4ai internals (see related issue: downloaded_files race condition).
Document URLs don't benefit from crawl4ai's built-in caching, markdown generation, or content filtering.
Link extraction from documents is impossible (documents are opaque to the ContentScrapingStrategy).
The workaround runs after the full HTML pipeline has already executed (wasted computation for document URLs).

What problem does this solve?

This is a feature request to add first-class document detection and extraction as a configurable part of the crawl pipeline.

Target users/beneficiaries

Any user that wants to crawl binary files, or other non-browser rendered files that are linked to. The internet is made of up documents and links, and regardless of whether the documents are rendered in browser, they fact they are linked makes them a useful target for scraping information.

Current alternatives/workarounds

What we built (external wrapper)

Our current solution wraps the crawl4ai result stream in a document_aware_stream async generator that post-processes each CrawlResult. This works but requires duplicating detection logic, working around the downloaded_files race condition (see separate issue), and processing documents after the HTML pipeline has already run.

Proposed approach

Suggested solution

Approach: Document detection + pluggable extraction before `ContentScrapingStrategy`

Add a new optional stage to the pipeline that runs between browser navigation and content scraping:

URL → Browser Navigation → AsyncCrawlResponse
                                    ↓
                        [NEW] DocumentExtractionStrategy.detect(response)
                                    ↓
                    ┌───────────────┴───────────────┐
                    │ is_document=True               │ is_document=False
                    ↓                                ↓
        DocumentExtractionStrategy          ContentScrapingStrategy
              .extract(response)                  .scrap(url, html)
                    ↓                                ↓
              raw text/markdown              ScrapingResult (as today)
                    ↓                                ↓
        MarkdownGenerationStrategy          MarkdownGenerationStrategy
              (optional — may skip             .generate_markdown(...)
               if already markdown)
                    ↓                                ↓
              CrawlResult                      CrawlResult
          (is_document=True)               (is_document=False)

Proposed interface

from abc import ABC, abstractmethod
from dataclasses import dataclass
from pathlib import Path
from typing import Optional


@dataclass
class DocumentExtractionResult:
    """Result of extracting text from a binary document."""
    content: str                    # Extracted text (markdown or plain text)
    content_type: str               # MIME type or extension (e.g., 'application/pdf', 'pdf')
    source_path: Optional[Path]     # Local file path if downloaded (for cleanup)


class DocumentExtractionStrategy(ABC):
    """Strategy for detecting and extracting text from binary documents."""

    @abstractmethod
    def detect(self, response: "AsyncCrawlResponse") -> bool:
        """Return True if the response represents a binary document.

        Implementations can check:
        - response.downloaded_files
        - response.response_headers (Content-Type)
        - URL extension
        - response.status_code / empty HTML
        """
        ...

    @abstractmethod
    async def extract(self, response: "AsyncCrawlResponse", url: str) -> DocumentExtractionResult:
        """Extract text content from the document.

        The response may contain a downloaded file path, or the
        implementation may need to download the document itself.
        """
        ...

Configuration

Add document_extraction_strategy to CrawlerRunConfig:

class CrawlerRunConfig:
    document_extraction_strategy: Optional[DocumentExtractionStrategy] = None

When set, aprocess_html (or the caller in arun) checks detect() before calling scraping_strategy.scrap(). If the response is a document, it calls extract() instead and builds the CrawlResult from the extraction output.

When None (default), behavior is identical to today — no breaking change.

Integration point in `arun`

In async_webcrawler.py, after receiving async_response from crawler_strategy.crawl() but before calling aprocess_html:

# After line ~408 in async_webcrawler.py
doc_strategy = config.document_extraction_strategy
if doc_strategy and doc_strategy.detect(async_response):
    doc_result = await doc_strategy.extract(async_response, url)
    # Build CrawlResult with document content
    crawl_result = CrawlResult(
        url=url,
        html="",  # No HTML for documents
        success=True,
        markdown=MarkdownGenerationResult(
            raw_markdown=doc_result.content,
            # fit_markdown can be generated via the configured
            # MarkdownGenerationStrategy if desired
        ),
        downloaded_files=async_response.downloaded_files,
        response_headers=async_response.response_headers,
        status_code=async_response.status_code,
        metadata={"is_document": True, "content_type": doc_result.content_type},
    )
else:
    # Existing HTML processing path
    crawl_result = await self.aprocess_html(...)

Benefits

No breaking changes — the strategy is optional and defaults to None.
Pluggable backends — users can implement DocumentExtractionStrategy with Kreuzberg, PyMuPDF, Docling, Unstructured, or any other extraction library.
Early detection — documents are identified before wasting cycles on HTML scraping.
Unified output — both HTML pages and documents produce CrawlResult objects with populated markdown, enabling downstream code (caching, content filtering, extraction strategies) to work uniformly.
Metadata — CrawlResult.metadata["is_document"] lets consumers distinguish documents from HTML pages when needed.
Fixes the downloaded_files pipeline gap — detection logic moves into crawl4ai where it has full access to the AsyncCrawlResponse, eliminating the need for users to reverse-engineer detection heuristics.

Example implementation (using Kreuzberg)

from kreuzberg import extract_file
from crawl4ai import DocumentExtractionStrategy, DocumentExtractionResult

class KreuzbergDocumentStrategy(DocumentExtractionStrategy):
    """Extract text from binary documents using Kreuzberg."""

    # Known document Content-Types
    DOCUMENT_TYPES = {
        "application/pdf", "application/msword",
        "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
        "application/vnd.ms-excel",
        "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
        # ... etc
    }

    # Known document extensions
    DOCUMENT_EXTENSIONS = {".pdf", ".doc", ".docx", ".xls", ".xlsx", ".pptx", ".ppt"}

    def detect(self, response) -> bool:
        # Signal 1: Browser downloaded a file
        if response.downloaded_files:
            return True

        # Signal 2: Document Content-Type header
        ct = (response.response_headers or {}).get("content-type", "").split(";")[0].strip()
        if ct in self.DOCUMENT_TYPES:
            return True

        # Signal 3: URL has document extension + navigation failed
        from pathlib import PurePosixPath
        from urllib.parse import urlparse
        ext = PurePosixPath(urlparse(response.redirected_url or "").path).suffix.lower()
        if ext in self.DOCUMENT_EXTENSIONS:
            return True

        return False

    async def extract(self, response, url) -> DocumentExtractionResult:
        if response.downloaded_files:
            path = Path(response.downloaded_files[0])
        else:
            # Download via Playwright or requests
            path = await self._download(url)

        result = await extract_file(str(path))
        return DocumentExtractionResult(
            content=result.content,
            content_type=path.suffix.lstrip("."),
            source_path=path,
        )

Usage

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(
        "https://example.gov/grants",
        config=CrawlerRunConfig(
            deep_crawl_strategy=BestFirstCrawlStrategy(max_depth=2),
            document_extraction_strategy=KreuzbergDocumentStrategy(),
        ),
    )

Alternatives considered

Subclassing `ContentScrapingStrategy`

We initially explored making a DocumentAwareScrapingStrategy that overrides scrap() to detect documents before parsing HTML. This doesn't work well because:

ContentScrapingStrategy.scrap() receives (url, html) — by the time it's called, the HTML has already been fetched and the response headers / downloaded_files are not available.
Document detection requires access to the AsyncCrawlResponse, which is not passed to the scraping strategy.
The scraping strategy is synchronous (scrap), but document extraction typically requires async I/O.

Post-processing hook

We also considered using crawl4ai's hook system (after_goto, etc.) to inject document detection. Hooks can modify page state but don't have a clean way to short-circuit the HTML processing pipeline or replace the CrawlResult content.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request]: Document-aware content processing in the crawl pipeline #1890

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

[Feature Request]: Document-aware content processing in the crawl pipeline #1890

Uh oh!

bkennedy-improving Mar 31, 2026

What needs to be done?

Current behavior

What happens when crawl4ai navigates to a document URL

The pipeline gap

Real-world impact

What problem does this solve?

Target users/beneficiaries

Current alternatives/workarounds

What we built (external wrapper)

Proposed approach

Suggested solution

Approach: Document detection + pluggable extraction before ContentScrapingStrategy

Proposed interface

Configuration

Integration point in arun

Benefits

Example implementation (using Kreuzberg)

Usage

Alternatives considered

Subclassing ContentScrapingStrategy

Post-processing hook

Replies: 0 comments

bkennedy-improving
Mar 31, 2026

Approach: Document detection + pluggable extraction before `ContentScrapingStrategy`

Integration point in `arun`

Subclassing `ContentScrapingStrategy`