Skip to content

How to extract html page metadata from HTMLToDocument converter #8781

Discussion options

You must be logged in to vote

Hello, I understand your issue but it seems more related to how Trafilatura handles the with_metadata argument.

! pip install trafilatura

import trafilatura

! wget https://haystack.deepset.ai/

with open("index.html", "r") as f:
    html = f.read()

print(trafilatura.extract(html, with_metadata=True))
---
title: Haystack | Haystack
description: Haystack, the composable open-source AI framework
sitename: Haystack
date: 2025-01-01
---
Highly
customizable
Don’t just use Haystack, build on top of it...

As you can see, directly using Trafilatura gives the same result (metadata included in text).

I would recommend further exploring Trafilatura and creating a Haystack custom component (easy) …

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@matthewcoole
Comment options

Answer selected by matthewcoole
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants