Fix downloading a linked css resource with no extension #636
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
While trying to download a textbook written with PreteXt, I ran into an issue with some google icons that they use not rendering properly. I found two bugs, one was incorrect filtering of css resources in link tags. The second bug involved the ArchiveDownloader incorrectly placing a generated filename of index.css at the root of the download, when hitting a linked resource with no extension, instead of nesting it into sub-directories relevant for the domain and other url paths of the resource. This caused further resources requested below the CSS file to incorrectly reference outside of the download directory entirely.
I saw this in the run of the ArchiveDownloader, which meant that this URL was not being downloaded/rewritten:
Skipping node with src https://fonts.googleapis.com/css2?family=Material+Symbols+Outlined:opsz,wght,FILL,GRAD@24,400,0,0This is the relevant line from the source site:
<link rel="stylesheet" href="https://fonts.googleapis.com/css2?family=Material+Symbols+Outlined:opsz,wght,FILL,GRAD@24,400,0,0">I found a bug in css_node_filter, it was excluding this tag even though it had a "rel" attribute. While the attributes can be accessed from the root Tag object like
node["rel"], theinoperator doesn't work as they are in a sub-property that gets re-exposed by the class.Observed results of download with this first change to fix the css resource filtering
The index.css file that was placed at the root had these contents, meaning it was making a relative path reference outside of the download directory:
It seemed like the correct fix was to get this index.css to be placed down into the file-system hierarchy based on the resource URL, rather than change how this relative path was being generated. The fix was simple once I found the relevant line of code to change. Currently the diff is a little bigger than needed as I assumed I might want to do some escaping of the URL to turn it into a file path, but thinking a bit longer on it I thought it might be reasonably likely anything that can be in an HTTP URL might be fine to be in a unix path (assuming this tool is only going to be run in unix environments), and that all slashes are used to create sub-directories. I can simplify it down the the one line version of the fix and remove the todo if you agree.
References
Page I was downloading:
https://activecalculus.org/single2e/frontmatter.html
Reviewer guidance
I'm interested in adding a automated test, I don't necessarily want to hot link to the source site in a test, but as this is just for the scraping library I assume these tests aren't run that often. Considering if it might be useful to add some kind of test tool that would enable just checking in a directory of test web resources that could be "downloaded" by exposing them through a little local webserver or something. Happy to hear any thoughts or guidance on testing.