-
Notifications
You must be signed in to change notification settings - Fork 1.1k
refactor!: Introduce the ContextPipeline abstraction
#3119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v4
Are you sure you want to change the base?
Conversation
| /** The main middleware function that enhances the context */ | ||
| action: (context: TCrawlingContext) => Promise<TCrawlingContextExtension>; | ||
| /** Optional cleanup function called after the consumer finishes or fails */ | ||
| cleanup?: (context: TCrawlingContext & TCrawlingContextExtension, error?: unknown) => Promise<void>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning a cleanup callback from action may be a better approach. A benefit of that would be having access to the outer scope
e8e52aa to
3a530b8
Compare
barjin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this quick so we unblock all the other PRs 😄
A few ideas to get the discussion started:
B4nan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first round of comments, it looks good overall, well done
barjin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thank you!
For keeping track: in person we briefly discussed typing the context in a separate createXYZRouter() when using extendContext. The current solution is having a separate custom Context type / interface and passing it to both the extendContext and the router-creating functions.
| const crawler = new CheerioCrawler({ | ||
| extendContext: () => ({ crawler }), | ||
| requestHandler: async (context) => { | ||
| if (Math.random() < 0.01) { | ||
| context.crawler.stop() | ||
| } | ||
| } | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is amazing, so much better than the initial proposal 👍
| session: !useIncognitoPages | ||
| ? (browserControllerInstance.launchContext.session as Session) | ||
| : crawlingContext.session, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd flip this one
| session: !useIncognitoPages | |
| ? (browserControllerInstance.launchContext.session as Session) | |
| : crawlingContext.session, | |
| session: useIncognitoPages | |
| ? crawlingContext.session | |
| : (browserControllerInstance.launchContext.session as Session), |
| "@apify/timeout": "^0.3.2", | ||
| "@crawlee/browser": "3.13.3", | ||
| "@crawlee/browser-pool": "3.13.3", | ||
| "@crawlee/cheerio": "3.13.3", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really import from this package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, AdaptivePlaywrightCrawler now uses CheerioCrawler directly (kind of)
ContextPipelinefrom the Python counterpart #2479HttpCrawler.use#3106skipNavigationworks #3107Plan
In my opinion, it makes a lot of sense to do the remaining changes in a separate PR.
ContextPipelineabstractionContextPipeline.composesignature and the semantics ofBasicCrawlerOptions.contextPipelineEnhancerto maximize DXcontextPipelineEnhancerIntent
The
context-pipelinebranch introduces a fundamental architectural change to how Crawlee crawlers build and enhance the crawling context passed to request handlers. The core motivation is to fix the composition and extensibility nightmare in the current crawler hierarchy.The Problem
Rigid inheritance hierarchy: Crawlers were stuck in a brittle inheritance chain where each layer manipulated the context object while assuming that it already satisfied its final type. Multiple overrides of
BasicCrawlerlifecycle methods made the execution flow even harder to follow.Context enhancement via monkey-patching: Manual property assignment (
crawlingContext.page = page,crawlingContext.$ = $) scattered everywhere. It was a mess to follow and impossible to reason about.Cleanup coordination: Resource cleanup was handled by separate
_cleanupContextmethods that were not co-located with the initialization.Extension mechanism was broken: The
CrawlerExtension.use()API tried to let you extend crawlers (the ones based onHttpCrawler) by overwriting properties - completely type-unsafe and fragile as hell.The Solution
Introduces
ContextPipeline- a middleware-based composition pattern where:actionfunctionscleanupfunctionsKey Design Decisions
1. Middleware Pattern
Declarative middleware composition with co-located cleanup:
2. Type-Safe Context Building
The
ContextPipeline<TBase, TFinal>tracks type transformations through the chain:3. New Extension Mechanism
The
CrawlerExtension.use()is gone. New approach viacontextPipelineEnhancer:Discussion Topics
1. The API
The current way to express a context pipeline middleware has some shortcomings (
ContextPipeline.compose,BasicCrawlerOptions.contextPipelineEnhancer). I suggest resolving this in another PR.2. Migration Path
For most legitimate use cases, this should be non-breaking. Those who extend the Crawler classes in non-trivial ways may need to adjust their code though - the non-public interface of
BasicCrawlerandHttpCrawlerchanged quite a bit.3. Performance
The pipeline uses
Object.definePropertiesfor each middleware. Is this a serious performance consideration?