Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive Sitemap Index Parsing #165

Open
shutupflanders opened this issue Jul 29, 2024 · 1 comment
Open

Recursive Sitemap Index Parsing #165

shutupflanders opened this issue Jul 29, 2024 · 1 comment

Comments

@shutupflanders
Copy link

This is a nice tool, I'll certainly be using it a lot more moving forward.

However, I noticed when testing a website that has a sitemap index file, it doesn't recursively parse the sitemaps within:

image

No biggie, but it would be good to see the full resultset if possible.

@varenc
Copy link

varenc commented Jul 29, 2024

++ to this.

Sitemaps that exist as a single file are usually the small ones that are easy manually look over. The ones that use many tiered layers of indirect references are exactly the ones where a tool is most valuable. One example of a complex multi-file sitemap: https://www.apple.com/sitemap.xml

The sitemap implementation I found here appears to make some other overly simple assumptions. Roughly:

  • If it finds /sitemap.xml at the root, is doesn't look in robots.txt, whereas I believe both can be valid
  • In robots.txt it assumes that only one sitemap URL is specified instead but the specs allows for multiple sitemap files to be specified. Example: https://www.apple.com/robots.txt
  • And like @shutupflanders said, it doesn't follow indirectly references sitemaps and assumes all sitemap content exists in one file.

Building this out fully is equivalent to just building a real sitemap parser for indexing, so perhaps one of those can just be repurposed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants