Skip to content

Conversation

@janbuchar
Copy link
Contributor

@janbuchar janbuchar commented Aug 7, 2025

Plan

In my opinion, it makes a lot of sense to do the remaining changes in a separate PR.

  • Introduce a ContextPipeline abstraction
  • Update crawlers to use it
  • Make sure that existing tests pass
  • Refine the ContextPipeline.compose signature and the semantics of BasicCrawlerOptions.contextPipelineEnhancer to maximize DX
  • Write tests for the contextPipelineEnhancer
  • Resolve added TODO comments (fix immediately or make issues)
  • Update documentation

Intent

The context-pipeline branch introduces a fundamental architectural change to how Crawlee crawlers build and enhance the crawling context passed to request handlers. The core motivation is to fix the composition and extensibility nightmare in the current crawler hierarchy.

The Problem

  1. Rigid inheritance hierarchy: Crawlers were stuck in a brittle inheritance chain where each layer manipulated the context object while assuming that it already satisfied its final type. Multiple overrides of BasicCrawler lifecycle methods made the execution flow even harder to follow.

  2. Context enhancement via monkey-patching: Manual property assignment (crawlingContext.page = page, crawlingContext.$ = $) scattered everywhere. It was a mess to follow and impossible to reason about.

  3. Cleanup coordination: Resource cleanup was handled by separate _cleanupContext methods that were not co-located with the initialization.

  4. Extension mechanism was broken: The CrawlerExtension.use() API tried to let you extend crawlers (the ones based on HttpCrawler) by overwriting properties - completely type-unsafe and fragile as hell.

The Solution

Introduces ContextPipeline - a middleware-based composition pattern where:

  • Each crawler layer defines how it enhances the context through explicit action functions
  • Cleanup logic is co-located with initialization via optional cleanup functions
  • Type safety is maintained through TypeScript generics that track context transformations
  • The pipeline executes middleware sequentially with proper error handling and guaranteed cleanup

Key Design Decisions

1. Middleware Pattern

Declarative middleware composition with co-located cleanup:

contextPipeline.compose({
  action: async (context) => ({ page, $ }),
  cleanup: async (context) => { await page.close(); }
})

2. Type-Safe Context Building

The ContextPipeline<TBase, TFinal> tracks type transformations through the chain:

ContextPipeline<CrawlingContext, CrawlingContext>
  .compose<{ page: Page }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page }>
  .compose<{ $: CheerioAPI }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page, $: CheerioAPI }>

3. New Extension Mechanism

The CrawlerExtension.use() is gone. New approach via contextPipelineEnhancer:

new BasicCrawler({
  contextPipelineEnhancer: (pipeline) => 
    pipeline.compose({
      action: async (context) => ({ myCustomProp: ... })
    })
})

Discussion Topics

1. The API

The current way to express a context pipeline middleware has some shortcomings (ContextPipeline.compose, BasicCrawlerOptions.contextPipelineEnhancer). I suggest resolving this in another PR.

2. Migration Path

For most legitimate use cases, this should be non-breaking. Those who extend the Crawler classes in non-trivial ways may need to adjust their code though - the non-public interface of BasicCrawler and HttpCrawler changed quite a bit.

3. Performance

The pipeline uses Object.defineProperties for each middleware. Is this a serious performance consideration?

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Aug 7, 2025
/** The main middleware function that enhances the context */
action: (context: TCrawlingContext) => Promise<TCrawlingContextExtension>;
/** Optional cleanup function called after the consumer finishes or fails */
cleanup?: (context: TCrawlingContext & TCrawlingContextExtension, error?: unknown) => Promise<void>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a cleanup callback from action may be a better approach. A benefit of that would be having access to the outer scope

@janbuchar janbuchar marked this pull request as ready for review October 16, 2025 13:26
Copy link
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this quick so we unblock all the other PRs 😄

A few ideas to get the discussion started:

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first round of comments, it looks good overall, well done

@janbuchar janbuchar requested review from B4nan and barjin November 20, 2025 14:48
Copy link
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you!

For keeping track: in person we briefly discussed typing the context in a separate createXYZRouter() when using extendContext. The current solution is having a separate custom Context type / interface and passing it to both the extendContext and the router-creating functions.

Comment on lines +72 to +79
const crawler = new CheerioCrawler({
extendContext: () => ({ crawler }),
requestHandler: async (context) => {
if (Math.random() < 0.01) {
context.crawler.stop()
}
}
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is amazing, so much better than the initial proposal 👍

Comment on lines +542 to +544
session: !useIncognitoPages
? (browserControllerInstance.launchContext.session as Session)
: crawlingContext.session,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd flip this one

Suggested change
session: !useIncognitoPages
? (browserControllerInstance.launchContext.session as Session)
: crawlingContext.session,
session: useIncognitoPages
? crawlingContext.session
: (browserControllerInstance.launchContext.session as Session),

"@apify/timeout": "^0.3.2",
"@crawlee/browser": "3.13.3",
"@crawlee/browser-pool": "3.13.3",
"@crawlee/cheerio": "3.13.3",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really import from this package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, AdaptivePlaywrightCrawler now uses CheerioCrawler directly (kind of)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants