What problem does this solve?
Currently, when processing llms.txt to get a list of pages to check, it strips the .md extension and uses the remaining path as the url, on the assumption that the pages use "clean" url paths with no extension. However, this isn't always the case - some docs still use the (older but still quite valid) filename.ext pattern. Currently, this means the checker just ends up with however a particular server is set up to handle those cases (whether a 404 or a redirect or what), which can skew the results of the check in ways that wouldn't be valid for a real agent.
Example:
https://www.example.com/filename.md in llms.txt is converted to https://www.example.com/filename by afdocs, but the server uses real filenames and is expecting https://www.example.com/filename.html. The exact result will depend on how that particular server handles those cases, but it'll quite likely be a 404.
What would you like to see?
Ideally it'd be great if the checker could magically sniff out what URL pattern a site uses, but probably a simpler solution would be to add a flag that lets the user specify how paths should be processed.
Alternatives considered
Can't really think of one.
What problem does this solve?
Currently, when processing llms.txt to get a list of pages to check, it strips the
.mdextension and uses the remaining path as the url, on the assumption that the pages use "clean" url paths with no extension. However, this isn't always the case - some docs still use the (older but still quite valid)filename.extpattern. Currently, this means the checker just ends up with however a particular server is set up to handle those cases (whether a 404 or a redirect or what), which can skew the results of the check in ways that wouldn't be valid for a real agent.Example:
https://www.example.com/filename.mdinllms.txtis converted tohttps://www.example.com/filenameby afdocs, but the server uses real filenames and is expectinghttps://www.example.com/filename.html. The exact result will depend on how that particular server handles those cases, but it'll quite likely be a 404.What would you like to see?
Ideally it'd be great if the checker could magically sniff out what URL pattern a site uses, but probably a simpler solution would be to add a flag that lets the user specify how paths should be processed.
Alternatives considered
Can't really think of one.