How to extract html page metadata from HTMLToDocument converter #8781
-
I'm looking at using the
My issues is that this metadata extracted from the web page ends up being put into the I thought I might be able to do this by setting the Has anyone done this kind of thing before or has an idea of how it could/should be done? Any help would be greatly appreciated :) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hello, I understand your issue but it seems more related to how Trafilatura handles the ! pip install trafilatura
import trafilatura
! wget https://haystack.deepset.ai/
with open("index.html", "r") as f:
html = f.read()
print(trafilatura.extract(html, with_metadata=True))
As you can see, directly using Trafilatura gives the same result (metadata included in text). I would recommend further exploring Trafilatura and creating a Haystack custom component (easy) to implement your logic. |
Beta Was this translation helpful? Give feedback.
Hello, I understand your issue but it seems more related to how Trafilatura handles the
with_metadata
argument.As you can see, directly using Trafilatura gives the same result (metadata included in text).
I would recommend further exploring Trafilatura and creating a Haystack custom component (easy) …