Skip to content

Conversation

@Yash-Hirgude
Copy link

Introduces #276

This PR adds a reusable function extractSchemaOrgMicrodata to the @crawlee/utils package.

It enables extracting Schema.org microdata from:

  • a browser DOM (e.g., Puppeteer / Playwright crawlers), or
  • raw HTML (e.g., HTTP crawler using JSDOM or Cheerio)

The extractor uses only native DOM APIs, no jQuery dependency.
The extractor is fully serializable, allowing it to run both in a browser context (via page.evaluate in Puppeteer/Playwright) and in Node.js environments (JSDOM/Cheerio), with no external dependencies.
Comprehensive test cases are included to ensure correct extraction across different input types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant