How to extract html page metadata from HTMLToDocument converter #8781

matthewcoole · 2025-01-28T16:11:21Z

matthewcoole
Jan 28, 2025

I'm looking at using the HTMLToDocument converter to extract data from a few simple webpages. HTMLToDocument seems to make use of trafilatura and I've used extraction_kwargs to make trafilatura extract additional pieces of metadata from the page (title, author, date etc.):

HTMLToDocument(extraction_kwargs={"with_metadata": True})

My issues is that this metadata extracted from the web page ends up being put into the content of the resultant Document when I'd really like it to end up either as keys in the Document or potentially under a meta key.

I thought I might be able to do this by setting the output_format of the HTMLToDocument to "json" and then piping the output into the JSONConverter but it expects ByteStreams not Documents as inputs. I've looked at using the OutputAdapter but I can't seem to make sense of the documentation as to whether this is possible or not.

Has anyone done this kind of thing before or has an idea of how it could/should be done?

Any help would be greatly appreciated :)

Answered by anakin87

Jan 28, 2025

Hello, I understand your issue but it seems more related to how Trafilatura handles the with_metadata argument.

! pip install trafilatura

import trafilatura

! wget https://haystack.deepset.ai/

with open("index.html", "r") as f:
    html = f.read()

print(trafilatura.extract(html, with_metadata=True))

---
title: Haystack | Haystack
description: Haystack, the composable open-source AI framework
sitename: Haystack
date: 2025-01-01
---
Highly
customizable
Don’t just use Haystack, build on top of it...

As you can see, directly using Trafilatura gives the same result (metadata included in text).

I would recommend further exploring Trafilatura and creating a Haystack custom component (easy) …

View full answer

anakin87 · 2025-01-28T16:54:45Z

anakin87
Jan 28, 2025
Maintainer

Hello, I understand your issue but it seems more related to how Trafilatura handles the with_metadata argument.

! pip install trafilatura

import trafilatura

! wget https://haystack.deepset.ai/

with open("index.html", "r") as f:
    html = f.read()

print(trafilatura.extract(html, with_metadata=True))

---
title: Haystack | Haystack
description: Haystack, the composable open-source AI framework
sitename: Haystack
date: 2025-01-01
---
Highly
customizable
Don’t just use Haystack, build on top of it...

As you can see, directly using Trafilatura gives the same result (metadata included in text).

I would recommend further exploring Trafilatura and creating a Haystack custom component (easy) to implement your logic.

1 reply

matthewcoole Jan 29, 2025
Author

Thanks @anakin87. Creating a custom component was my first thought - I just imagined I might be able to wrangle what was already there to do what I want. Thanks again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract html page metadata from HTMLToDocument converter #8781

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to extract html page metadata from HTMLToDocument converter #8781

matthewcoole Jan 28, 2025

Replies: 1 comment · 1 reply

anakin87 Jan 28, 2025 Maintainer

matthewcoole Jan 29, 2025 Author

matthewcoole
Jan 28, 2025

Replies: 1 comment 1 reply

anakin87
Jan 28, 2025
Maintainer

matthewcoole Jan 29, 2025
Author