-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the ability to crawl content links #15
base: main
Are you sure you want to change the base?
Conversation
nuxt.json
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably remove this file XD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Translation: Yes, I have deleted it, please check again.
PR Analysis
PR Feedback
|
Preparing review... |
thanks for contributing @994AK! my main question - how is this different than the current enqueueLinks code we have? Aka the crawler already finds links and crawls them, so I'm interested in understanding what you find to be missing and what required you to create (from what I can tell) an additional implementation of what already existed Perhaps sharing your use case would help contextualize this? |
I'm glad to see your reply, @steve8708. Feature IntroductionMy feature is primarily designed for "when scraping this content, if there are external links within the content, it allows the crawler to further access these external links, generate a JSON file to store them, and expand the content."
Use Case Exampleexport const config: Config = {
url: "https://www.builder.io/c/docs/developers",
match: "https://www.builder.io/c/docs/**",
selector: `.docs-builder-container`,
isContentLink: true,
withoutSelector: `main`,
attributeWhitelist: ["href", "title"],
maxPagesToCrawl: 50,
outputFileName: "output.json",
}; When isContentLink is enabled, it first accesses https://www.builder.io/c/docs/developers to retrieve the initial page and fetch the links within it. The content within the withoutSelector selector is scraped and saved. This material can then be saved for users as extended information. I am currently out of town and unable to use a computer, so I apologize for any inconvenience. What do you think of this feature idea? Looking forward to your response. |
I see, very interesting. So sounds like there are a few things going on here
await enqueueLinks({
strategy: "all",
}); this definitely would be preferrable as opposed to reimplementing the crawlers enquiuing logic ourselves
Related, do we strip empty non-semantic tags? e.g. would be good to remove all kinds of That said, preserving HTML could be a useful thing for people to be able to experiment with I would propose perhaps changing the configuration to something like this export const config: Config = {
url: "https://www.builder.io/c/docs/developers",
match: "https://www.builder.io/c/docs/**",
selector: `.docs-builder-container`,
crawlExternalPages: false,
includeHtml: false,
htmlAttributeWhitelist: ['title', 'href'], // only needed if `includeHtml` is `true`
maxPagesToCrawl: 50,
outputFileName: "output.json",
};
|
I think I can refactor the code to make it easier to understand and make the functions more complete. @steve8708 Regarding preserving HTML, I understand your point, and it could be useful for users to experiment with. Additionally, your idea of removing empty non-semantic tags is a good one, as it can help save unnecessary bytes. Considering your suggestions, I have thought about making some adjustments to the configuration, as follows:
withoutSelectorAs for 'withoutSelector,' this option is used to specify an HTML selector, and if a page contains a tag that matches this selector, it won't be crawled. For example, if you set 'withoutSelector' to '.no-crawl,' any page containing a tag with the class 'no-crawl' will be excluded from crawling based on the conditions you provide. I plan to refactor the code during my vacation and resubmit it. Thanks again for your suggestions and feedback, and I hope we can improve this project together! |
Hi, I've added a deep web crawler that extracts links as an extension to this article.
withoutSelector
crawls internal external links without a selector.attributeWhitelist
allows you to optimize performance by whitelisting HTML attributes to be preserved.isContentLink
determines whether to enable crawling internal external links.I hope you find this PR useful. If you have any questions, please feel free to reach out to me.