Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove Kconv.toutf8 conversion #16

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

pessi-v
Copy link

@pessi-v pessi-v commented Jul 4, 2024

In lib/ogpr/fetcher/html_fetcher.rb:20 the fetched meta tag content is forced to UTF-8 using the stdlib Kconv. This conversion seems unnecessary, but also introduces a lot of wrongly converted characters. In my use case, a lot of accented latin letters are converted to chinese characters. This also seems to happen with some punctuation.

@pessi-v
Copy link
Author

pessi-v commented Jul 4, 2024

@hirakiuc

@hirakiuc
Copy link
Owner

hirakiuc commented Jul 5, 2024

Thanks for your report, and this PR. 😃

But, at first, I don't recommend to use this rubygem in production 🙏🏼
Because..., this library was implemented several years ago, and not maintained well for long time.

@@ -17,7 +17,6 @@ def fetch(headers = {})
acceptable_content!(head.headers[:content_type])

res = send_request(:get, @uri, headers)
Kconv.toutf8(res.to_str)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary: my opinion is that such behavior (converting string encodings in this gem) better to be configurable for various use cases, instead of removing this line simply.

read the followings for the detail. 🙏🏼


At first, let's check the String value in the OGP spec.
https://ogp.me/#string 👀

As you can see in the official docs, String value is described as A sequence of Unicode characters. (Unicode, but not UTF-8)
So, I think that this gem should follow the String value spec as possible.

Based on this thought, and just for my personal use,
I had decided to convert those web contents(meta tags) into UTF-8 encoding.
(I think that this is the root cause of those encoding issue in this gem, and my bad decision. 😢 )

However, web contents (especially meta tag values in HTML files in this context) could be in various encodings as you know.
After merging your PR, users of this library will have to consider OGP string encoding without any additional information (like, which string encoding was used in each web site).

Due to above reason, I don't think that removing converting string encodings is the best way, like this PR. 🤔

So, as the result, as I wrote in the head of this comment,
my opinion is that such behavior (converting string encodings in this gem) better to be configurable for various cases.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I simply made a GitHub issue for this encoding issue, #17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants