Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(crawler): relative URL handling on non-start pages #893

Merged
merged 2 commits into from
Nov 12, 2024

Conversation

mogery
Copy link
Member

@mogery mogery commented Nov 11, 2024

Fixes #821

Went for an easier fix. Just fixes the logic when adding relative URLs to the crawl from the site content. Was basing the new URL off of the wrong base URL.

@rafaelsideguide
Copy link
Collaborator

Hey @mogery, unfortunately, this doesn't fix the bug.

For the following example:

POST http://localhost:3002/v1/crawl HTTP/1.1
Authorization: Bearer fc-redacted
content-type: application/json

{
  "url": "https://docs.cleanlab.ai",
  "allowBackwardLinks": true
}

One of the pages I was expecting to find is https://docs.cleanlab.ai/stable/cleanlab/multilabel_classification/rank.html, but in the results, the crawler only retrieved the non-redirected URL https://docs.cleanlab.ai/cleanlab/multilabel_classification/rank.html (without /stable), which leads to a 404:

Captura de Tela 2024-11-12 às 09 31 48

The base URL https://docs.cleanlab.ai redirects to https://docs.cleanlab.ai/stable/index.html through a non-DNS-based redirect (which we only catch after the first page response). This is causing the 404s.

@mogery
Copy link
Member Author

mogery commented Nov 12, 2024

My bad, forgot about that case. Will be fixing after standup.

@mogery
Copy link
Member Author

mogery commented Nov 12, 2024

@rafaelsideguide should work now, retest pls

@rafaelsideguide
Copy link
Collaborator

Looking good. 1354 pages crawled for this url now (no 404s apparently :D) let's merge it!

@mogery mogery merged commit fbabc77 into main Nov 12, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug/Investigation] Some relative paths are incomplete
2 participants