Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double encoding of anchor tag href #188

Open
marcelreppi opened this issue Jan 7, 2023 · 4 comments
Open

Double encoding of anchor tag href #188

marcelreppi opened this issue Jan 7, 2023 · 4 comments

Comments

@marcelreppi
Copy link
Contributor

Hey,

I've been using your amazing library to extract links from DOM content in a service worker and noticed that achor tags which, for example, lead to some file which has already encoded space characters in the URL (www.test.com/path%20to%20some%20file.pdf) end up with a broken href because the % character is encoded again (www.test.com/path%2520to%2520some%2520file.pdf).

I have a workaround by just calling decodeURIComponent on the href attribute before processing it but I guess that is not the intended behavior and shouldn't be like that.
I am sure that it's related to #49 and the fix for it here

get href() { return encodeURI(stringAttribute.get(this, 'href')); }

I would guess and hope it's not a complicated fix.

Thanks a lot for your work!

@WebReflection
Copy link
Owner

this is weird to fix because AFAIK the href getter does sanitize its content as attribute ... can you please write a test that works on browsers but fails in here?

@marcelreppi
Copy link
Contributor Author

marcelreppi commented May 12, 2023

Imagine an anchor tag pointing to a file. This file has space characters in its filename which means the " " character will already be encoded as "%20". For example:

<a href="https://www.westonpriory.org/esales/lyrics/Something%20Which%20Is%20Known.pdf"></a>

When I add the following test to test/html/anchor-element.js it fails.

a.setAttribute('href', 'https://www.westonpriory.org/esales/lyrics/Something%20Which%20Is%20Known.pdf');
assert(a.href, 'https://www.westonpriory.org/esales/lyrics/Something%20Which%20Is%20Known.pdf');

I would expect to get the same link because otherwise the link is broken.

 Expected: https://www.westonpriory.org/esales/lyrics/Something%20Which%20Is%20Known.pdf
 Got instead: https://www.westonpriory.org/esales/lyrics/Something%2520Which%2520Is%2520Known.pdf

I opened a PR to fix this: #204.

@WebReflection
Copy link
Owner

Agreed and I hate double encoding/decoding shenanigans myself ... the thing I am not understanding is where that string gets encoded in the first place and, in case I really want to write manually a url with %20 in it, if it's the right thing to decide that regardless as space ... if it's the 3rd party library doing that, or me doing that while processing the HTML/XML/SVG though, I believe the gotcha should be solved there or I should pay more attention to every other sensible accessor that points at attributes already encoded ... thoughts?

@marcelreppi
Copy link
Contributor Author

I am thinking mainly of file servers that serve static files and <a> tags that point to those files. When those file names contain spaces, the paths to those files must encode special characters like spaces (unless the server has some internal logic to make them available under different paths). In other cases like you mentioned with 3rd party libraries I would agree with you.

If you have concerns I can also just use the workaround that I mentioned in the issue to move that logic outside of the library but then, in my case, the library would not have the same behavior as regular DOM queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants