feat(document_loaders): add exclude_urls to SitemapLoader #56

CaillaudPA · 2025-05-14T09:37:48Z

Description

This PR adds an exclude_urls parameter to the SitemapLoader class, allowing users to specify patterns for URLs that should be excluded from loading. This feature is particularly valuable for incremental content synchronization scenarios.

Key Use Case: Docusaurus Incremental Sync

When working with Docusaurus documentation sites, a common need is to only sync new or updated pages while skipping already processed content. This feature enables efficient incremental syncing by:

Maintaining a list of already processed URLs
Using exclude_urls to skip those URLs in subsequent runs
Only loading and processing new or modified content

Example usage with DocusaurusLoader:

# Store previously processed URLs
processed_urls = ["docs/tutorial/.*", "docs/api/v1/.*"]

# Only load new content
loader = DocusaurusLoader(
    "https://my-docs.com",
    exclude_urls=processed_urls,
    continue_on_failure=True
)

Benefits

Reduced Processing Time: Skip already processed pages
Lower Resource Usage: Avoid redundant content downloads
Efficient Updates: Focus only on new or modified content
Flexible Pattern Matching: Use regex patterns for precise control

Implementation Details

Added exclude_urls parameter that accepts a list of regex patterns
Patterns are checked after domain restrictions and filter patterns
URLs matching any exclude pattern are skipped
Maintains backward compatibility with existing usage

Testing

Added integration tests to verify:

Basic URL exclusion functionality
Interaction between filter and exclude patterns
Edge cases with domain restrictions

Documentation

Updated docstrings and comments to explain:

Parameter usage and behavior
Regex pattern interpretation
Example use cases

Add exclude_urls parameter to SitemapLoader to allow excluding specific URLs from being loaded. This is particularly useful for incremental content syncing scenarios, such as Docusaurus documentation updates. - Add exclude_urls parameter with regex pattern support - Update DocusaurusLoader to expose the new parameter - Add integration tests for URL exclusion functionality - Update documentation to reflect new capability

eyurtsev · 2025-06-02T20:58:09Z

libs/community/tests/integration_tests/document_loaders/test_sitemap.py

@@ -84,6 +84,26 @@ def test_filter_sitemap() -> None:
    assert "LangChain Python API" in documents[0].page_content


+def test_exclude_sitemap() -> None:


Is there a way to create a unit test so the testing can happen on every PR? in langchain-community integrations tests aren't run as part of CI

CaillaudPA force-pushed the feature/sitemap-loader-exclude-urls branch from 7537527 to 2985937 Compare May 14, 2025 09:57

eyurtsev reviewed Jun 2, 2025

View reviewed changes

eyurtsev self-assigned this Jun 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(document_loaders): add exclude_urls to SitemapLoader #56

feat(document_loaders): add exclude_urls to SitemapLoader #56

Uh oh!

CaillaudPA commented May 14, 2025

Uh oh!

eyurtsev Jun 2, 2025

Uh oh!

Uh oh!

		@@ -84,6 +84,26 @@ def test_filter_sitemap() -> None:
		assert "LangChain Python API" in documents[0].page_content


		def test_exclude_sitemap() -> None:

feat(document_loaders): add exclude_urls to SitemapLoader #56

Are you sure you want to change the base?

feat(document_loaders): add exclude_urls to SitemapLoader #56

Uh oh!

Conversation

CaillaudPA commented May 14, 2025

Description

Key Use Case: Docusaurus Incremental Sync

Benefits

Implementation Details

Testing

Documentation

Uh oh!

eyurtsev Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!