Clarification on JsonXPathExtractionStrategy Schema and type Parameter #1664

ca-arch · 2025-12-09T11:51:13Z

ca-arch
Dec 9, 2025

Hello,

I'm currently using JsonXPathExtractionStrategy and have some questions regarding its schema definition:

The XPath expressions defined in the schema seem slightly different from standard XPath syntax. Could you clarify if there are any specific rules or limitations?

The type field in each schema entry is not fully documented. Could you provide a detailed explanation of the available types and how they affect data extraction?

A more comprehensive documentation or examples for schema usage would be very helpful.

Thank you for your time and support!

hafezparast · 2026-03-27T12:23:53Z

hafezparast
Mar 27, 2026
Sponsor

The docs page at docs.crawl4ai.com/extraction/no-llm-strategies covers all the types but they're spread across examples rather than in one reference table. Here's a consolidated summary:

Simple types:

Type	Description	Extra field needed
`"text"`	Text content of the element	—
`"attribute"`	HTML attribute value	`"attribute": "href"`
`"html"`	Raw inner HTML	—
`"regex"`	Regex on text content	`"pattern"`, optionally `"group"`

Compound types:

Type	Description	Extra fields needed
`"nested"`	Single nested object	`"selector"` + `"fields": [...]`
`"list"`	List of items	`"selector"` + `"fields": [...]`
`"nested_list"`	List of nested objects	`"selector"` + `"fields": [...]`

Pipeline: type can also be a list like ["text", "regex"] — values flow through each step sequentially. For example:

{
    "name": "price_number",
    "selector": ".//span[@class='price']",
    "type": ["text", "regex"],
    "pattern": "\d+\.?\d*",
    "group": 0
}

Each field can also have:

"default": "N/A" — fallback value if the selector doesn't match
"transform": "strip" — post-processing transformation

For XPath selectors: JsonXPathExtractionStrategy uses standard XPath 1.0 via lxml. The key thing to remember is to use .// prefix on field selectors to make them relative to the base element (without it, XPath searches from the document root).

{
    "baseSelector": "//div[contains(@class, 'product-card')]",
    "fields": [
        {"name": "title", "selector": ".//h2", "type": "text"},
        {"name": "link", "selector": ".//a", "type": "attribute", "attribute": "href"},
        {"name": "tags", "type": "list", "selector": ".//span[@class='tag']",
         "fields": [{"name": "tag", "type": "text"}]}
    ]
}

You're right that a dedicated reference page for this would help — the info exists in the docs but it's not easy to find in one place.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on JsonXPathExtractionStrategy Schema and type Parameter #1664

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Clarification on JsonXPathExtractionStrategy Schema and type Parameter #1664

Uh oh!

ca-arch Dec 9, 2025

Replies: 1 comment

Uh oh!

hafezparast Mar 27, 2026 Sponsor

ca-arch
Dec 9, 2025

hafezparast
Mar 27, 2026
Sponsor