Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the ability to crawl content links #15

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

994AK
Copy link

@994AK 994AK commented Nov 17, 2023

Hi, I've added a deep web crawler that extracts links as an extension to this article.

  • withoutSelector crawls internal external links without a selector.
  • attributeWhitelist allows you to optimize performance by whitelisting HTML attributes to be preserved.
  • isContentLink determines whether to enable crawling internal external links.

I hope you find this PR useful. If you have any questions, please feel free to reach out to me.

nuxt.json Outdated
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably remove this file XD

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Translation: Yes, I have deleted it, please check again.

@994AK 994AK requested a review from Joehoel November 18, 2023 06:10
@vaibhavkumar-sf
Copy link

PR Analysis

  • 🎯 Main theme: Added the ability to crawl content links
  • 📝 PR summary: This PR adds functionality to crawl content links as an extension to the existing article. It introduces new options to crawl internal and external links, optimize performance by whitelisting HTML attributes, and determine whether to enable crawling internal and external links.
  • 📌 Type of PR: Enhancement
  • 🧪 Relevant tests added: False
  • ⏱️ Estimated effort to review [1-5]: 3, because the PR introduces new functionality and requires reviewing the added code and logic.
  • 🔒 Security concerns: No security concerns found

PR Feedback

  • 💡 General suggestions: - It would be helpful to provide more context in the PR description about the purpose and use cases of the new options and how they can be configured.

  • Consider adding some inline comments to explain the purpose and functionality of the new code.

  • Ensure that the code follows consistent formatting and indentation throughout the file.

  • 🤖 Code feedback:

    • relevant file: src/main.ts
      suggestion: Consider extracting the logic for handling external links into a separate function for better modularity and readability. [important]
      relevant line: async function handleExternalLinks(

    • relevant file: src/main.ts
      suggestion: Add error handling for any potential exceptions that may occur during the processing of external links. [medium]
      relevant line: } catch (error) {

    • relevant file: src/main.ts
      suggestion: Consider using a more descriptive variable name instead of 'jsonData' to improve code readability. [medium]
      relevant line: const jsonData = {

    • relevant file: src/main.ts
      suggestion: Add input validation and error handling for the new options to ensure that they are properly configured. [medium]
      relevant line: withoutSelector: string;

@vaibhavkumar-sf
Copy link

Preparing review...

@steve8708
Copy link
Contributor

thanks for contributing @994AK!

my main question - how is this different than the current enqueueLinks code we have?

Aka the crawler already finds links and crawls them, so I'm interested in understanding what you find to be missing and what required you to create (from what I can tell) an additional implementation of what already existed

Perhaps sharing your use case would help contextualize this?

@994AK
Copy link
Author

994AK commented Nov 21, 2023

I'm glad to see your reply, @steve8708.

Feature Introduction

My feature is primarily designed for "when scraping this content, if there are external links within the content, it allows the crawler to further access these external links, generate a JSON file to store them, and expand the content."

  • isContentLink - Whether to enable it. Access external links within the content and add additional information.
  • withoutSelector - Like external links. Choose the area that is not fixed, typically in terms of specifications, the main content is the most important.
  • attributeWhitelist - Preserve some HTML attributes. I didn't remove the HTML structure because when scraping, I want GPT to understand the structure for expanding queries and reading comprehension.

Use Case Example

export const config: Config = {
  url: "https://www.builder.io/c/docs/developers",
  match: "https://www.builder.io/c/docs/**",
  selector: `.docs-builder-container`,
  isContentLink: true,
  withoutSelector: `main`,
  attributeWhitelist: ["href", "title"],
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};

When isContentLink is enabled, it first accesses https://www.builder.io/c/docs/developers to retrieve the initial page and fetch the links within it.

The content within the withoutSelector selector is scraped and saved.

This material can then be saved for users as extended information.

I am currently out of town and unable to use a computer, so I apologize for any inconvenience.

What do you think of this feature idea? Looking forward to your response.

@steve8708
Copy link
Contributor

I see, very interesting. So sounds like there are a few things going on here

  1. allowing crawling external links that are discovered. this actually can be done more simply by providing a configuration to crawlee
await enqueueLinks({
  strategy: "all",
});

this definitely would be preferrable as opposed to reimplementing the crawlers enquiuing logic ourselves

  1. Preserve HTML. In general, I've seen the OpenAI team recommend that they support just fine copy and pasting contents of webpages without the HTML structure. Additionally, the HTML can be quite bloated and openAI does have limits as to how large knowledge files can be.

Related, do we strip empty non-semantic tags? e.g. would be good to remove all kinds of <div><div><span> from the HTML to not waste bytes on meaningless empty tags

That said, preserving HTML could be a useful thing for people to be able to experiment with

I would propose perhaps changing the configuration to something like this

export const config: Config = {
  url: "https://www.builder.io/c/docs/developers",
  match: "https://www.builder.io/c/docs/**",
  selector: `.docs-builder-container`,
  crawlExternalPages: false,
  includeHtml: false,
  htmlAttributeWhitelist: ['title', 'href'], // only needed if `includeHtml` is `true`
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};
  1. I'm not sure I fully understand what withoutSelector does, could you elaborate on that? Is the idea not to crawl any page that does not have a main tag (if that is the selector provided)?

@994AK
Copy link
Author

994AK commented Nov 22, 2023

I think I can refactor the code to make it easier to understand and make the functions more complete. @steve8708
Thank you for the suggestions, enqueueLinks does indeed seem to work much better than our custom implementation.

Regarding preserving HTML, I understand your point, and it could be useful for users to experiment with. Additionally, your idea of removing empty non-semantic tags is a good one, as it can help save unnecessary bytes.

Considering your suggestions, I have thought about making some adjustments to the configuration, as follows:

  url: "https://www.builder.io/c/docs/developers",
  match: "https://www.builder.io/c/docs/**",
  selector: `.docs-builder-container`,
  crawlExternalPages: false,
  includeHtml: true, // Set includeHtml to true to preserve HTML
  htmlAttributeWhitelist: ['title', 'href'], // Only needed if includeHtml is true
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};

withoutSelector

As for 'withoutSelector,' this option is used to specify an HTML selector, and if a page contains a tag that matches this selector, it won't be crawled. For example, if you set 'withoutSelector' to '.no-crawl,' any page containing a tag with the class 'no-crawl' will be excluded from crawling based on the conditions you provide.

I plan to refactor the code during my vacation and resubmit it. Thanks again for your suggestions and feedback, and I hope we can improve this project together!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants