-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adjust LinkContentFetcher run method, use ByteStream #5972
Conversation
949c428
to
de23db6
Compare
@julian-risch now after we sorted out #5933 this one should be good to review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see this adapted to the new ByteStream data class. 👍 I have a couple of smaller change requests. We should also create release notes for this PR. Not only the return type changed but also the input is a list now and the fetching uses multithreading.
Thanks for the review @julian-risch , please have another look |
@julian-risch and @ZanSara a bit more context for you with these last few changes. During experiments with the actual web retriever pipeline, I realized that we needed to change the component to return the list of streams, where each stream has mime/content type in metadata rather than returning a dict of stream lists. Otherwise, this component would be similar to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite good, just two change requests. The first one is a small change in many of the tests. Let's add checks for the correct content type value in metadata to all the tests.
Second, let's rename v2
in test_link_content_fetcher_multiple_different_content_types_v2
so that it better describes how the test differs from test_link_content_fetcher_multiple_different_content_types
.
My understanding after also having a quick look at #5998 is that the reason for setting the content type in the metadata and having only a single output edge is that the splitting into multiple edges can already be handled by FileTypeRouter. Having a short usage example in the docstring where users can see that both components work well together would be great. If possible, an end-to-end test should test them in combination in a full pipeline when the PRs are merged.
assert "Introduction to Haystack" in document.text | ||
assert document.metadata["url"] == HTML_URL | ||
streams = fetcher.run([HTML_URL])["streams"] | ||
assert "Haystack" in streams[0].data.decode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add assert stream.metadata["content_type"] == "text/html"
here and for all other tests above and below with the corresponding content type.
Updated @julian-risch with 0cc9236 Note that imho it doesn't make sense to check content type for mocks so I didn't add additional content type checks everywhere |
Still makes sense to me to do the checks because we're just mocking requests. The extraction of the content type from the response is could still be checked. |
@vblagoje One more question that came up: If I have multiple input URLs, how do I know which ByteStream in the output corresponds to which URL? At the moment, I don't because there is no URL in the metadata, right? |
Great idea, see dd56a64 |
The logic in 9a69cc8 made the most sense to me. Perhaps we should return a list of successful and failed retrievals, wdyt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
Why:
During web retriever pipeline setup and testing we realized a limitation in our
LinkContentFetcher
- it was only set up to fetch content from a single URL while our pipelines work with lists.What:
We've revamped the
LinkContentFetcher
to now handle a list of URLs.How can it be used:
The usage remains pretty much the same, but with the added flexibility. Now, instead of passing a single URL, you can pass a list of URLs to
LinkContentFetcher
Testing:
We've beefed up our unit tests to cover this new functionality. Additionally, some hands-on manual testing was done to ensure that the fetcher behaves correctly with multiple URLs.
Notes For Reviewer:
Be gentle :-)