refactor!: Introduce the `ContextPipeline` abstraction #3119

janbuchar · 2025-08-07T11:22:23Z

Plan

In my opinion, it makes a lot of sense to do the remaining changes in a separate PR.

Introduce a ContextPipeline abstraction
Update crawlers to use it
Make sure that existing tests pass
Refine the ContextPipeline.compose signature and the semantics of BasicCrawlerOptions.contextPipelineEnhancer to maximize DX
Write tests for the contextPipelineEnhancer
Resolve added TODO comments (fix immediately or make issues)
Update documentation

Intent

The context-pipeline branch introduces a fundamental architectural change to how Crawlee crawlers build and enhance the crawling context passed to request handlers. The core motivation is to fix the composition and extensibility nightmare in the current crawler hierarchy.

The Problem

Rigid inheritance hierarchy: Crawlers were stuck in a brittle inheritance chain where each layer manipulated the context object while assuming that it already satisfied its final type. Multiple overrides of BasicCrawler lifecycle methods made the execution flow even harder to follow.
Context enhancement via monkey-patching: Manual property assignment (crawlingContext.page = page, crawlingContext.$ = $) scattered everywhere. It was a mess to follow and impossible to reason about.
Cleanup coordination: Resource cleanup was handled by separate _cleanupContext methods that were not co-located with the initialization.
Extension mechanism was broken: The CrawlerExtension.use() API tried to let you extend crawlers (the ones based on HttpCrawler) by overwriting properties - completely type-unsafe and fragile as hell.

The Solution

Introduces ContextPipeline - a middleware-based composition pattern where:

Each crawler layer defines how it enhances the context through explicit action functions
Cleanup logic is co-located with initialization via optional cleanup functions
Type safety is maintained through TypeScript generics that track context transformations
The pipeline executes middleware sequentially with proper error handling and guaranteed cleanup

Key Design Decisions

1. Middleware Pattern

Declarative middleware composition with co-located cleanup:

contextPipeline.compose({
  action: async (context) => ({ page, $ }),
  cleanup: async (context) => { await page.close(); }
})

2. Type-Safe Context Building

The ContextPipeline<TBase, TFinal> tracks type transformations through the chain:

ContextPipeline<CrawlingContext, CrawlingContext>
  .compose<{ page: Page }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page }>
  .compose<{ $: CheerioAPI }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page, $: CheerioAPI }>

3. New Extension Mechanism

The CrawlerExtension.use() is gone. New approach via contextPipelineEnhancer:

new BasicCrawler({
  contextPipelineEnhancer: (pipeline) => 
    pipeline.compose({
      action: async (context) => ({ myCustomProp: ... })
    })
})

Discussion Topics

1. The API

The current way to express a context pipeline middleware has some shortcomings (ContextPipeline.compose, BasicCrawlerOptions.contextPipelineEnhancer). I suggest resolving this in another PR.

2. Migration Path

For most legitimate use cases, this should be non-breaking. Those who extend the Crawler classes in non-trivial ways may need to adjust their code though - the non-public interface of BasicCrawler and HttpCrawler changed quite a bit.

3. Performance

The pipeline uses Object.defineProperties for each middleware. Is this a serious performance consideration?

…archy

janbuchar · 2025-08-27T15:15:39Z

packages/core/src/crawlers/context_pipeline.ts

+    /** The main middleware function that enhances the context */
+    action: (context: TCrawlingContext) => Promise<TCrawlingContextExtension>;
+    /** Optional cleanup function called after the consumer finishes or fails */
+    cleanup?: (context: TCrawlingContext & TCrawlingContextExtension, error?: unknown) => Promise<void>;


Returning a cleanup callback from action may be a better approach. A benefit of that would be having access to the outer scope

packages/http-crawler/src/internals/file-download.ts

barjin

Let's merge this quick so we unblock all the other PRs 😄

A few ideas to get the discussion started:

packages/core/src/errors.ts

test/core/crawlers/cheerio_crawler.test.ts

test/core/crawlers/basic_browser_crawler.ts

packages/linkedom-crawler/src/internals/linkedom-crawler.ts

packages/core/src/crawlers/crawler_commons.ts

packages/cheerio-crawler/src/internals/cheerio-crawler.ts

B4nan

first round of comments, it looks good overall, well done

packages/basic-crawler/src/internals/constants.ts

packages/core/src/crawlers/crawler_commons.ts

packages/cheerio-crawler/src/internals/cheerio-crawler.ts

packages/core/src/crawlers/context_pipeline.ts

packages/browser-crawler/src/internals/browser-crawler.ts

packages/basic-crawler/src/internals/basic-crawler.ts

barjin

lgtm, thank you!

For keeping track: in person we briefly discussed typing the context in a separate createXYZRouter() when using extendContext. The current solution is having a separate custom Context type / interface and passing it to both the extendContext and the router-creating functions.

B4nan · 2025-11-21T14:29:45Z

docs/upgrading/upgrading_v4.md

+const crawler = new CheerioCrawler({
+  extendContext: () => ({ crawler }),
+  requestHandler: async (context) => {
+    if (Math.random() < 0.01) {
+      context.crawler.stop()
+    }
+  }
+})


this is amazing, so much better than the initial proposal 👍

B4nan · 2025-11-21T14:31:57Z

packages/browser-crawler/src/internals/browser-crawler.ts

+            session: !useIncognitoPages
+                ? (browserControllerInstance.launchContext.session as Session)
+                : crawlingContext.session,


i'd flip this one

Suggested change

session: !useIncognitoPages

? (browserControllerInstance.launchContext.session as Session)

: crawlingContext.session,

session: useIncognitoPages

? crawlingContext.session

: (browserControllerInstance.launchContext.session as Session),

B4nan · 2025-11-21T14:34:39Z

packages/playwright-crawler/package.json

        "@apify/timeout": "^0.3.2",
        "@crawlee/browser": "3.13.3",
        "@crawlee/browser-pool": "3.13.3",
+        "@crawlee/cheerio": "3.13.3",


do we really import from this package?

yeah, AdaptivePlaywrightCrawler now uses CheerioCrawler directly (kind of)

refactor!: Introduce the ContextPipeline abstraction

c92a0d2

janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Aug 7, 2025

janbuchar added 6 commits August 8, 2025 11:02

Expose ContextMiddleware type

9d881a2

Simplify ContextPipeline type

37c6f45

Adjust ContextPipeline interface

84fff41

Get rid of HttpCrawler.use

c26594e

Use ContextPipeline in the HttpCrawler part of the crawler class hier…

82f79b1

…archy

Refine error handler signatures

28a5187

janbuchar commented Aug 27, 2025

View reviewed changes

janbuchar added 15 commits September 9, 2025 17:21

Refine types

a7860d7

Improve types in basic crawler

69d6270

Improve ErrorTracker and ErrorSnapshotter types

c691c1c

Migrate crawlers to ContextPipeline

7205cb7

Update AdaptivePlaywrightCrawler to use ContextPipeline

1e564a1

Update lockfile

7c1ab99

Use contextPipelineEnhancer in AdaptivePlaywrightCrawler

250ba8c

Resolve type error

be9ee3f

Resolve lint errors

2eab7a2

Handler error handlers

e8567d6

Handle navigation hooks

f6338e2

Fix type errors

b96d6b0

Update tests

dd2df4f

Fix ow validation schema

4b7b66d

Run contextPipelineBuilder lazily

81434f7

janbuchar mentioned this pull request Oct 7, 2025

fix: use shared enqueue links wrapper in AdaptivePlaywrightCrawler #3188

Merged

janbuchar added 3 commits October 9, 2025 15:42

Fix basic crawler test

de9fd99

Fix cheerio crawler tests; Simplify request handler timeout logic

5bb45f2

Fix BrowserCrawler tests

3a530b8

janbuchar force-pushed the context-pipeline branch from e8e52aa to 3a530b8 Compare October 15, 2025 13:28

Fix adaptive crawler tests

f947ae7

janbuchar added 4 commits October 16, 2025 11:08

Fix the rest of adaptive crawler tests

a87ed39

Fix collateral damage in basic crawler test

e512371

Update and fix file download test

6c545ed

Sort imports

3cd8290

janbuchar marked this pull request as ready for review October 16, 2025 13:26

janbuchar requested review from B4nan, barjin and vladfrangu October 16, 2025 13:26

janbuchar commented Oct 16, 2025

View reviewed changes

packages/http-crawler/src/internals/file-download.ts Show resolved Hide resolved

barjin reviewed Oct 17, 2025

View reviewed changes

B4nan reviewed Oct 30, 2025

View reviewed changes

janbuchar added 10 commits November 12, 2025 16:36

More idiomatic errors

00060fe

Use impit in cheerio crawler test

ed07e79

Fix lint

2e863dd

Refactor parseContent methods

38b2672

contextPipelineEnhancer -> extendContext

869542f

Remove unnecessary type constraints

9a8420a

Update upgrading guide

a913110

List all removed symbols in upgrading guide

543e732

Update jsdoc

98f7c4e

More doc

ee737d4

janbuchar requested review from B4nan and barjin November 20, 2025 14:48

barjin approved these changes Nov 20, 2025

View reviewed changes

B4nan approved these changes Nov 21, 2025

View reviewed changes

refactor!: Introduce the ContextPipeline abstraction #3119

Are you sure you want to change the base?

refactor!: Introduce the ContextPipeline abstraction #3119

Uh oh!

Conversation

janbuchar commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Plan

Intent

The Problem

The Solution

Key Design Decisions

1. Middleware Pattern

2. Type-Safe Context Building

3. New Extension Mechanism

Discussion Topics

1. The API

2. Migration Path

3. Performance

Uh oh!

janbuchar Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

B4nan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

B4nan Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

B4nan Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

B4nan Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

refactor!: Introduce the `ContextPipeline` abstraction #3119

refactor!: Introduce the `ContextPipeline` abstraction #3119

janbuchar commented Aug 7, 2025 •

edited

Loading