Added the ability to crawl content links #15

994AK · 2023-11-17T14:10:59Z

Hi, I've added a deep web crawler that extracts links as an extension to this article.

withoutSelector crawls internal external links without a selector.
attributeWhitelist allows you to optimize performance by whitelisting HTML attributes to be preserved.
isContentLink determines whether to enable crawling internal external links.

I hope you find this PR useful. If you have any questions, please feel free to reach out to me.

Joehoel · 2023-11-17T16:09:55Z

nuxt.json

You should probably remove this file XD

Translation: Yes, I have deleted it, please check again.

vaibhavkumar-sf · 2023-11-19T09:14:44Z

PR Analysis

🎯 Main theme: Added the ability to crawl content links
📝 PR summary: This PR adds functionality to crawl content links as an extension to the existing article. It introduces new options to crawl internal and external links, optimize performance by whitelisting HTML attributes, and determine whether to enable crawling internal and external links.
📌 Type of PR: Enhancement
🧪 Relevant tests added: False
⏱️ Estimated effort to review [1-5]: 3, because the PR introduces new functionality and requires reviewing the added code and logic.
🔒 Security concerns: No security concerns found

PR Feedback

💡 General suggestions: - It would be helpful to provide more context in the PR description about the purpose and use cases of the new options and how they can be configured.
Consider adding some inline comments to explain the purpose and functionality of the new code.
Ensure that the code follows consistent formatting and indentation throughout the file.
🤖 Code feedback:
- relevant file: src/main.ts
  suggestion: Consider extracting the logic for handling external links into a separate function for better modularity and readability. [important]
  relevant line: async function handleExternalLinks(
- relevant file: src/main.ts
  suggestion: Add error handling for any potential exceptions that may occur during the processing of external links. [medium]
  relevant line: } catch (error) {
- relevant file: src/main.ts
  suggestion: Consider using a more descriptive variable name instead of 'jsonData' to improve code readability. [medium]
  relevant line: const jsonData = {
- relevant file: src/main.ts
  suggestion: Add input validation and error handling for the new options to ensure that they are properly configured. [medium]
  relevant line: withoutSelector: string;

vaibhavkumar-sf · 2023-11-19T09:18:48Z

Preparing review...

steve8708 · 2023-11-20T18:31:10Z

thanks for contributing @994AK!

my main question - how is this different than the current enqueueLinks code we have?

Aka the crawler already finds links and crawls them, so I'm interested in understanding what you find to be missing and what required you to create (from what I can tell) an additional implementation of what already existed

Perhaps sharing your use case would help contextualize this?

994AK · 2023-11-21T02:17:30Z

I'm glad to see your reply, @steve8708.

Feature Introduction

My feature is primarily designed for "when scraping this content, if there are external links within the content, it allows the crawler to further access these external links, generate a JSON file to store them, and expand the content."

isContentLink - Whether to enable it. Access external links within the content and add additional information.
withoutSelector - Like external links. Choose the area that is not fixed, typically in terms of specifications, the main content is the most important.
attributeWhitelist - Preserve some HTML attributes. I didn't remove the HTML structure because when scraping, I want GPT to understand the structure for expanding queries and reading comprehension.

Use Case Example

export const config: Config = {
  url: "https://www.builder.io/c/docs/developers",
  match: "https://www.builder.io/c/docs/**",
  selector: `.docs-builder-container`,
  isContentLink: true,
  withoutSelector: `main`,
  attributeWhitelist: ["href", "title"],
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};

When isContentLink is enabled, it first accesses https://www.builder.io/c/docs/developers to retrieve the initial page and fetch the links within it.

The content within the withoutSelector selector is scraped and saved.

This material can then be saved for users as extended information.

I am currently out of town and unable to use a computer, so I apologize for any inconvenience.

What do you think of this feature idea? Looking forward to your response.

steve8708 · 2023-11-21T16:30:54Z

I see, very interesting. So sounds like there are a few things going on here

allowing crawling external links that are discovered. this actually can be done more simply by providing a configuration to crawlee

await enqueueLinks({
  strategy: "all",
});

this definitely would be preferrable as opposed to reimplementing the crawlers enquiuing logic ourselves

Preserve HTML. In general, I've seen the OpenAI team recommend that they support just fine copy and pasting contents of webpages without the HTML structure. Additionally, the HTML can be quite bloated and openAI does have limits as to how large knowledge files can be.

Related, do we strip empty non-semantic tags? e.g. would be good to remove all kinds of <div><div><span> from the HTML to not waste bytes on meaningless empty tags

That said, preserving HTML could be a useful thing for people to be able to experiment with

I would propose perhaps changing the configuration to something like this

export const config: Config = {
  url: "https://www.builder.io/c/docs/developers",
  match: "https://www.builder.io/c/docs/**",
  selector: `.docs-builder-container`,
  crawlExternalPages: false,
  includeHtml: false,
  htmlAttributeWhitelist: ['title', 'href'], // only needed if `includeHtml` is `true`
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};

I'm not sure I fully understand what withoutSelector does, could you elaborate on that? Is the idea not to crawl any page that does not have a main tag (if that is the selector provided)?

994AK · 2023-11-22T00:57:07Z

I think I can refactor the code to make it easier to understand and make the functions more complete. @steve8708
Thank you for the suggestions, enqueueLinks does indeed seem to work much better than our custom implementation.

Regarding preserving HTML, I understand your point, and it could be useful for users to experiment with. Additionally, your idea of removing empty non-semantic tags is a good one, as it can help save unnecessary bytes.

Considering your suggestions, I have thought about making some adjustments to the configuration, as follows:

  url: "https://www.builder.io/c/docs/developers",
  match: "https://www.builder.io/c/docs/**",
  selector: `.docs-builder-container`,
  crawlExternalPages: false,
  includeHtml: true, // Set includeHtml to true to preserve HTML
  htmlAttributeWhitelist: ['title', 'href'], // Only needed if includeHtml is true
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};

withoutSelector

As for 'withoutSelector,' this option is used to specify an HTML selector, and if a page contains a tag that matches this selector, it won't be crawled. For example, if you set 'withoutSelector' to '.no-crawl,' any page containing a tag with the class 'no-crawl' will be excluded from crawling based on the conditions you provide.

I plan to refactor the code during my vacation and resubmit it. Thanks again for your suggestions and feedback, and I hope we can improve this project together!

Added the ability to crawl content links

969ce70

Joehoel reviewed Nov 17, 2023

View reviewed changes

Delete nuxt.json

3245424

994AK requested a review from Joehoel November 18, 2023 06:10

Merge branch 'BuilderIO:main' into main

224fe0f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added the ability to crawl content links #15

Added the ability to crawl content links #15

994AK commented Nov 17, 2023

Joehoel Nov 17, 2023

994AK Nov 17, 2023

vaibhavkumar-sf commented Nov 19, 2023

vaibhavkumar-sf commented Nov 19, 2023

steve8708 commented Nov 20, 2023

994AK commented Nov 21, 2023

steve8708 commented Nov 21, 2023

994AK commented Nov 22, 2023

Added the ability to crawl content links #15

Are you sure you want to change the base?

Added the ability to crawl content links #15

Conversation

994AK commented Nov 17, 2023

Joehoel Nov 17, 2023

Choose a reason for hiding this comment

994AK Nov 17, 2023

Choose a reason for hiding this comment

vaibhavkumar-sf commented Nov 19, 2023

PR Analysis

PR Feedback

vaibhavkumar-sf commented Nov 19, 2023

steve8708 commented Nov 20, 2023

994AK commented Nov 21, 2023

Feature Introduction

Use Case Example

steve8708 commented Nov 21, 2023

994AK commented Nov 22, 2023

withoutSelector