Skip to content

feat(document_loaders): add exclude_urls to SitemapLoader #56

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

CaillaudPA
Copy link

Description

This PR adds an exclude_urls parameter to the SitemapLoader class, allowing users to specify patterns for URLs that should be excluded from loading. This feature is particularly valuable for incremental content synchronization scenarios.

Key Use Case: Docusaurus Incremental Sync

When working with Docusaurus documentation sites, a common need is to only sync new or updated pages while skipping already processed content. This feature enables efficient incremental syncing by:

  1. Maintaining a list of already processed URLs
  2. Using exclude_urls to skip those URLs in subsequent runs
  3. Only loading and processing new or modified content

Example usage with DocusaurusLoader:

# Store previously processed URLs
processed_urls = ["docs/tutorial/.*", "docs/api/v1/.*"]

# Only load new content
loader = DocusaurusLoader(
    "https://my-docs.com",
    exclude_urls=processed_urls,
    continue_on_failure=True
)

Benefits

  • Reduced Processing Time: Skip already processed pages
  • Lower Resource Usage: Avoid redundant content downloads
  • Efficient Updates: Focus only on new or modified content
  • Flexible Pattern Matching: Use regex patterns for precise control

Implementation Details

  • Added exclude_urls parameter that accepts a list of regex patterns
  • Patterns are checked after domain restrictions and filter patterns
  • URLs matching any exclude pattern are skipped
  • Maintains backward compatibility with existing usage

Testing

Added integration tests to verify:

  • Basic URL exclusion functionality
  • Interaction between filter and exclude patterns
  • Edge cases with domain restrictions

Documentation

Updated docstrings and comments to explain:

  • Parameter usage and behavior
  • Regex pattern interpretation
  • Example use cases

Add exclude_urls parameter to SitemapLoader to allow excluding specific URLs
from being loaded. This is particularly useful for incremental content syncing
scenarios, such as Docusaurus documentation updates.

- Add exclude_urls parameter with regex pattern support
- Update DocusaurusLoader to expose the new parameter
- Add integration tests for URL exclusion functionality
- Update documentation to reflect new capability
@CaillaudPA CaillaudPA force-pushed the feature/sitemap-loader-exclude-urls branch from 7537527 to 2985937 Compare May 14, 2025 09:57
@@ -84,6 +84,26 @@ def test_filter_sitemap() -> None:
assert "LangChain Python API" in documents[0].page_content


def test_exclude_sitemap() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to create a unit test so the testing can happen on every PR? in langchain-community integrations tests aren't run as part of CI

@eyurtsev eyurtsev self-assigned this Jun 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants