feat(document_loaders): add exclude_urls to SitemapLoader #56
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds an
exclude_urls
parameter to theSitemapLoader
class, allowing users to specify patterns for URLs that should be excluded from loading. This feature is particularly valuable for incremental content synchronization scenarios.Key Use Case: Docusaurus Incremental Sync
When working with Docusaurus documentation sites, a common need is to only sync new or updated pages while skipping already processed content. This feature enables efficient incremental syncing by:
exclude_urls
to skip those URLs in subsequent runsExample usage with DocusaurusLoader:
Benefits
Implementation Details
exclude_urls
parameter that accepts a list of regex patternsTesting
Added integration tests to verify:
Documentation
Updated docstrings and comments to explain: