feat: add "markdown" field type to extraction strategy schema (#1708) by hafezparast · Pull Request #1881 · unclecode/crawl4ai

hafezparast · 2026-03-28T16:41:35Z

Summary

Adds "markdown" as a new field type in JsonElementExtractionStrategy's type pipeline
Converts selected element's HTML to clean markdown (preserves bold, links, lists, headers) without returning raw HTML tags
Works across all strategy subclasses: JsonCssExtractionStrategy, JsonLxmlExtractionStrategy, JsonXPathExtractionStrategy
Composable in pipelines: ["markdown", "regex"] works naturally

Changes

crawl4ai/extraction_strategy.py: 8 lines — add elif step == "markdown" branch + CustomHTML2Text import
tests/test_markdown_field_type_1708.py: 16 tests covering CSS/lxml/XPath strategies, pipeline chaining, edge cases, JSON serialization

Usage

from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "articles",
    "baseSelector": "div.article",
    "fields": [
        {"name": "title", "selector": "h1", "type": "text"},
        {"name": "body", "selector": "div.content", "type": "markdown"},  # NEW
        {"name": "author", "selector": "span.author", "type": "text"},
    ],
}

strategy = JsonCssExtractionStrategy(schema)

Before (only options):

Type	Output
`"text"`	`Best seller - Our most popular widget with advanced features.`
`"html"`	`<p><strong>Best seller</strong> - Our most popular widget with <em>advanced features</em>.</p>`

After (new option):

Type	Output
`"markdown"`	`Best seller - Our most popular widget with _advanced features_.`

Also works in pipelines:

# Extract markdown, then regex a pattern from it
{"name": "highlight", "selector": "p.desc", "type": ["markdown", "regex"], "pattern": r"\*\*(.+?)\*\*"}

Test plan

16 unit tests passing (pytest tests/test_markdown_field_type_1708.py -v)
Existing pipeline tests pass with no regressions (pytest tests/test_pr_1290_1668.py -v)
Verified across JsonCss, JsonLxml, and JsonXPath strategies
Verified JSON serialization of results works cleanly

🤖 Generated with Claude Code

…lecode#1708) Add "markdown" as a new field type in the extraction schema type pipeline. When used, the selected element's HTML is converted to markdown via CustomHTML2Text, preserving formatting (bold, links, lists) without returning raw HTML tags. Works across all strategy subclasses (CSS, lxml, XPath) and in pipelines (e.g., ["markdown", "regex"]). Closes unclecode#1708 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add "markdown" field type to extraction strategy schema (#1708)#1881

feat: add "markdown" field type to extraction strategy schema (#1708)#1881
hafezparast wants to merge 1 commit intounclecode:mainfrom
hafezparast:feat/maysam-markdown-field-type-1708

hafezparast commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hafezparast commented Mar 28, 2026

Summary

Changes

Usage

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant