Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unproper handling of national characters #433

Open
ArturRuta opened this issue Feb 7, 2025 · 1 comment
Open

Unproper handling of national characters #433

ArturRuta opened this issue Feb 7, 2025 · 1 comment

Comments

@ArturRuta
Copy link

Occassionally when processing articles wallabag is not handling properly the national characters.
The behavior is deterministic in the sense that for a guiven page it's allways the same, it either processes it properly or it doesn't.

This url corresponds to a page that it's allways unproperly processed: error example

  • You will find that the article tittle, even if containing national characters it's properly handled. For example it contais the work: más
  • On the other side, the article contents is not propoerly handled. Very early in the text you can see for example the word automóvil that is wrong. It should look like automóvil instead.

Surprisingly enoug some articles are properly handled. This url from the same site contains naional characters as well but is propoerly handled correct sample

I've done some research.

  • Looking into the prostgres tables were content is recorded i see in the entry table that the content is already trashed there. Therefore is not a matter on how it's rendered/shown when the articles are presented. Problem arises earlier when parsing the article.
  • I've tested the problem URL at the site f43.me and the problem is reproduced. Text shows unproper combinations when national characters are present. When I enable debug in this site...well, no errors are reported. Curiously enough the languaje is properly identified as es (which stand for spanish)
  • Finally I've enabled grabby debut logs, collected them during articler parsing and will attach them to this case
@ArturRuta
Copy link
Author

ArturRuta commented Feb 7, 2025

Please find the graby log below
graby.log

As far as I can see there's no error reported and it can be seen there that contents have garbage characters...but moving from there to a possbile solutions is really beyond my capabilitees.

Thanks a lot in advance for any help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant