[Feature Request]: Document-aware content processing in the crawl pipeline #1890
bkennedy-improving
started this conversation in
Feature requests
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
What needs to be done?
crawl4ai's pipeline assumes every URL yields HTML that can be parsed by
ContentScrapingStrategyand converted to markdown byMarkdownGenerationStrategy. When a URL points to a binary document (PDF, DOCX, XLSX, etc.), the pipeline either fails silently (empty markdown) or produces garbage output from the browser's built-in viewer chrome. There is no hook for users to intercept document URLs and route them through an appropriate extraction backend (e.g., Kreuzberg, PyMuPDF, Docling) before the content scraping phase.Current behavior
What happens when crawl4ai navigates to a document URL
There are three outcomes depending on the document type and browser behavior:
1. Browser triggers a download (
ERR_ABORTED)For file types the browser can't render (XLSX, DOCX, DOC, XLS, PPTX, etc.), Chromium aborts navigation and triggers a download event. crawl4ai catches the
net::ERR_ABORTEDerror and setsresponse = None:The
CrawlResulthassuccess=False(orsuccess=Truewith empty HTML), and the file path appears indownloaded_files. TheContentScrapingStrategyreceives an empty HTML string — it produces no useful output.2. Browser renders inline (Chrome PDF viewer)
For PDFs, Chromium renders them in its built-in viewer. The
page.content()call returns the PDF viewer's HTML wrapper — not the PDF's text content. TheContentScrapingStrategyandMarkdownGenerationStrategythen process this viewer chrome, producing markdown like:The actual PDF text is inaccessible through the DOM.
3. Navigation fails entirely
For some server configurations, navigating to a document URL returns an error (e.g., the server sends
Content-Disposition: attachmentwithout a renderable response). TheCrawlResulthassuccess=Falsewith no HTML and nodownloaded_files.The pipeline gap
The current pipeline is:
There is no step between navigation and content scraping where the pipeline can:
downloaded_files, URL extension, orERR_ABORTED).CrawlResultwith meaningful markdown content.Users who need to crawl sites that mix HTML pages and document links must implement all of this detection and extraction logic after receiving the
CrawlResult— effectively reimplementing a parallel content pipeline outside of crawl4ai.Real-world impact
Government grant portals (our use case) are a representative example: a single grant listing page links to PDFs (funding announcements), XLSX files (application forms, scoring templates), DOCX files (guidelines), and HTML pages (FAQs, contact info). All of these need to be crawled and their text extracted.
With the current architecture, we had to build a
document_aware_streamwrapper that:CrawlResultfrom the crawl4ai stream.downloaded_files, responseContent-Typeheaders, and URL extension + failed navigation.APIRequestContext(for signals 2–3).DocumentAwareResultdataclass that normalizes the interface for both HTML and document results.This works but has significant downsides:
downloaded_filesrace condition).ContentScrapingStrategy).What problem does this solve?
This is a feature request to add first-class document detection and extraction as a configurable part of the crawl pipeline.
Target users/beneficiaries
Any user that wants to crawl binary files, or other non-browser rendered files that are linked to. The internet is made of up documents and links, and regardless of whether the documents are rendered in browser, they fact they are linked makes them a useful target for scraping information.
Current alternatives/workarounds
What we built (external wrapper)
Our current solution wraps the crawl4ai result stream in a
document_aware_streamasync generator that post-processes eachCrawlResult. This works but requires duplicating detection logic, working around thedownloaded_filesrace condition (see separate issue), and processing documents after the HTML pipeline has already run.Proposed approach
Suggested solution
Approach: Document detection + pluggable extraction before
ContentScrapingStrategyAdd a new optional stage to the pipeline that runs between browser navigation and content scraping:
Proposed interface
Configuration
Add
document_extraction_strategytoCrawlerRunConfig:When set,
aprocess_html(or the caller inarun) checksdetect()before callingscraping_strategy.scrap(). If the response is a document, it callsextract()instead and builds theCrawlResultfrom the extraction output.When
None(default), behavior is identical to today — no breaking change.Integration point in
arunIn
async_webcrawler.py, after receivingasync_responsefromcrawler_strategy.crawl()but before callingaprocess_html:Benefits
None.DocumentExtractionStrategywith Kreuzberg, PyMuPDF, Docling, Unstructured, or any other extraction library.CrawlResultobjects with populated markdown, enabling downstream code (caching, content filtering, extraction strategies) to work uniformly.CrawlResult.metadata["is_document"]lets consumers distinguish documents from HTML pages when needed.downloaded_filespipeline gap — detection logic moves into crawl4ai where it has full access to theAsyncCrawlResponse, eliminating the need for users to reverse-engineer detection heuristics.Example implementation (using Kreuzberg)
Usage
Alternatives considered
Subclassing
ContentScrapingStrategyWe initially explored making a
DocumentAwareScrapingStrategythat overridesscrap()to detect documents before parsing HTML. This doesn't work well because:ContentScrapingStrategy.scrap()receives(url, html)— by the time it's called, the HTML has already been fetched and the response headers / downloaded_files are not available.AsyncCrawlResponse, which is not passed to the scraping strategy.scrap), but document extraction typically requires async I/O.Post-processing hook
We also considered using crawl4ai's hook system (
after_goto, etc.) to inject document detection. Hooks can modify page state but don't have a clean way to short-circuit the HTML processing pipeline or replace theCrawlResultcontent.Beta Was this translation helpful? Give feedback.
All reactions