Skip to content

fix: resolve relative URLs against parent dir when base_fork_url is an HTML file#5278

Merged
liuruibin merged 1 commit into
1Panel-dev:v2from
dracpet:fix/web-crawl-html-base-url
May 22, 2026
Merged

fix: resolve relative URLs against parent dir when base_fork_url is an HTML file#5278
liuruibin merged 1 commit into
1Panel-dev:v2from
dracpet:fix/web-crawl-html-base-url

Conversation

@dracpet
Copy link
Copy Markdown
Contributor

@dracpet dracpet commented May 21, 2026

Bug: Web site crawl resolves relative links incorrectly when base_fork_url is an HTML file

Problem

When crawling a documentation site where pages link to index.html (extremely common — breadcrumbs, logos, "Home" links), MaxKB follows that link. The new Fork instance gets base_fork_url ending in .html (e.g. https://docs.dolphindb.com/en/index.html).

The reset_url function then appends /field_value/ to the base URL and calls urljoin(..., "."). Since index.html/ looks like a directory to urljoin, every relative link on the page gets index.html/ injected into its path:

Input:  base_fork_url = https://docs.dolphindb.com/en/index.html
        field_value   = about_dolphindb.html

Old code:
  urljoin('https://docs.dolphindb.com/en/index.html/about_dolphindb.html/', '.')
  → https://docs.dolphindb.com/en/index.html/about_dolphindb.html  ← WRONG, 404

Expected:
  https://docs.dolphindb.com/en/about_dolphindb.html  ← CORRECT

This cascades: every child page that links to index.html produces a poisoned crawl level where all further relative links 404, silently killing the scrape with zero content on most pages.

Reproduction

  1. Create a Web Site knowledge base with URL https://docs.dolphindb.com/en/ and selector body
  2. Sync — the index page discovers index.html as a child link
  3. All child links from index.html resolve to broken paths like /en/index.html/getting_started.html → 404
  4. Result: ~2 documents with content, 130+ empty/error

Fix

Two changes in apps/common/utils/fork.py:

1. reset_url (line 114-124): When base_fork_url ends in .html/.htm, resolve relative links against the parent directory instead of the file path.

2. get_child_link_list (line 95-99): Use a crawl_prefix (parent directory for HTML files) for the link filter, so correctly-resolved child URLs aren't filtered out by the startswith(base_fork_url) check.

Verification

  • Applied to MaxKB v2.x, scraped https://docs.dolphindb.com/en/ with selector main800+ documents, 0 broken URLs
  • All child links resolve correctly regardless of whether the current page URL ends in .html
  • No regression: directory-based URLs (e.g. /en/) resolve identically to before

Important: If this PR is acceptable, please also review the depth parameter. The hardcoded depth=2 in sync_web_knowledge and sync_replace_web_knowledge limits most knowledge base crawls to 2 hops. Consider making it configurable or increasing the default — 3 uncovered 6x more documents in our test case.

…n HTML file

When crawling a site where pages link to index.html, the Fork instance
gets base_fork_url ending in .html. The reset_url function then appends
/field_value/ and calls urljoin(..., '.'), which treats index.html/ as
a directory. Every relative link gets index.html/ injected into its path
(e.g. .../en/index.html/about_dolphindb.html), causing 404 cascade.

Fix: in reset_url, resolve against parent dir for .html/.htm base URLs.
In get_child_link_list, use crawl_prefix (parent dir for HTML files)
for the link filter so correctly-resolved URLs aren't filtered out.

Verified: scraped docs.dolphindb.com/en/ with selector 'main' → 800+
documents, 0 broken URLs.
@f2c-ci-robot
Copy link
Copy Markdown

f2c-ci-robot Bot commented May 21, 2026

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@f2c-ci-robot
Copy link
Copy Markdown

f2c-ci-robot Bot commented May 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@liuruibin liuruibin merged commit 2216002 into 1Panel-dev:v2 May 22, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants