feat(utils): add Schema.org microdata extraction utility #3246

Yash-Hirgude · 2025-11-15T17:34:15Z

Introduces #276

This PR adds a reusable function extractSchemaOrgMicrodata to the @crawlee/utils package.

It enables extracting Schema.org microdata from:

a browser DOM (e.g., Puppeteer / Playwright crawlers), or
raw HTML (e.g., HTTP crawler using JSDOM or Cheerio)

The extractor uses only native DOM APIs, no jQuery dependency.
The extractor is fully serializable, allowing it to run both in a browser context (via page.evaluate in Puppeteer/Playwright) and in Node.js environments (JSDOM/Cheerio), with no external dependencies.
Comprehensive test cases are included to ensure correct extraction across different input types.

Yash-Hirgude added 2 commits November 15, 2025 22:25

extract microdata function and tests added

b140bcc

removed unecessary space

ec34f7a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(utils): add Schema.org microdata extraction utility #3246

feat(utils): add Schema.org microdata extraction utility #3246

Uh oh!

Yash-Hirgude commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(utils): add Schema.org microdata extraction utility #3246

Are you sure you want to change the base?

feat(utils): add Schema.org microdata extraction utility #3246

Uh oh!

Conversation

Yash-Hirgude commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant