Skip to content

feat: add "markdown" field type to extraction strategy schema (#1708)#1881

Open
hafezparast wants to merge 1 commit intounclecode:mainfrom
hafezparast:feat/maysam-markdown-field-type-1708
Open

feat: add "markdown" field type to extraction strategy schema (#1708)#1881
hafezparast wants to merge 1 commit intounclecode:mainfrom
hafezparast:feat/maysam-markdown-field-type-1708

Conversation

@hafezparast
Copy link
Copy Markdown
Contributor

Summary

  • Adds "markdown" as a new field type in JsonElementExtractionStrategy's type pipeline
  • Converts selected element's HTML to clean markdown (preserves bold, links, lists, headers) without returning raw HTML tags
  • Works across all strategy subclasses: JsonCssExtractionStrategy, JsonLxmlExtractionStrategy, JsonXPathExtractionStrategy
  • Composable in pipelines: ["markdown", "regex"] works naturally

Closes #1708

Changes

  • crawl4ai/extraction_strategy.py: 8 lines — add elif step == "markdown" branch + CustomHTML2Text import
  • tests/test_markdown_field_type_1708.py: 16 tests covering CSS/lxml/XPath strategies, pipeline chaining, edge cases, JSON serialization

Usage

from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "articles",
    "baseSelector": "div.article",
    "fields": [
        {"name": "title", "selector": "h1", "type": "text"},
        {"name": "body", "selector": "div.content", "type": "markdown"},  # NEW
        {"name": "author", "selector": "span.author", "type": "text"},
    ],
}

strategy = JsonCssExtractionStrategy(schema)

Before (only options):

Type Output
"text" Best seller - Our most popular widget with advanced features.
"html" <p><strong>Best seller</strong> - Our most popular widget with <em>advanced features</em>.</p>

After (new option):

Type Output
"markdown" **Best seller** - Our most popular widget with _advanced features_.

Also works in pipelines:

# Extract markdown, then regex a pattern from it
{"name": "highlight", "selector": "p.desc", "type": ["markdown", "regex"], "pattern": r"\*\*(.+?)\*\*"}

Test plan

  • 16 unit tests passing (pytest tests/test_markdown_field_type_1708.py -v)
  • Existing pipeline tests pass with no regressions (pytest tests/test_pr_1290_1668.py -v)
  • Verified across JsonCss, JsonLxml, and JsonXPath strategies
  • Verified JSON serialization of results works cleanly

🤖 Generated with Claude Code

…lecode#1708)

Add "markdown" as a new field type in the extraction schema type pipeline.
When used, the selected element's HTML is converted to markdown via
CustomHTML2Text, preserving formatting (bold, links, lists) without
returning raw HTML tags.

Works across all strategy subclasses (CSS, lxml, XPath) and in
pipelines (e.g., ["markdown", "regex"]).

Closes unclecode#1708

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant