Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if "undeclared" encoding is really not UTF-8 before converting it to UTF-8 #133

Closed
ThomasLandauer opened this issue Aug 7, 2020 · 4 comments
Labels

Comments

@ThomasLandauer
Copy link
Contributor

When I have an email body which is in fact UTF-8 but the email lacks any charset declaration, you treat is as ASCII and encode it to UTF-8. Which is in sync with RFC 2045:

Default RFC 822 messages without a MIME Content-Type header are taken by this protocol to be plain text in the US-ASCII character set

So $message->getTextContent() gives me something like:

fööbär

Thunderbird does the same. However, php-mime-mail-parser's $parser->getMessageBody('text') keeps it as-is:

fööbär

I'm "fixing" this in my code by checking: When I utf8_decode() your output (i.e. "reverse" the conversion you did automatically), is it still UTF-8?:

if (mb_check_encoding(utf8_decode($message->getTextContent()), 'UTF-8'))
{
    $text = utf8_decode($message->getTextContent());
}
else
{
    $text = $message->getTextContent();
}
@zbateson
Copy link
Owner

zbateson commented Aug 7, 2020

I recommend overriding the default charset in use in that case:

$message->getHtmlPart()->setCharsetOverride('utf-8');
echo $message->getHtmlContent();

You can read more about default charsets here: #110

@ThomasLandauer
Copy link
Contributor Author

That won't work in that case, cause in advance I don't know the charset either ;-)

@zbateson
Copy link
Owner

Yup, there's no easy fix. If you want to use mb_check_encoding yourself though, you're free to do so. There's no 'good way' of knowing if a charset is 'something' without some indication though, all that function does is check that bytes within the passed string 'pass' as the charset you give it, it doesn't know if that's the original charset.

@ThomasLandauer
Copy link
Contributor Author

Yes, indeed. As far as I see it, mb_check_encoding() just checks if each byte (or byte sequence) is a formally valid character. So mb_check_encoding('fööbär', 'UTF-8') returns true. The only thing that mb_check_encoding() saves you from is totally invalid characters, as mentioned at php-mime-mail-parser/php-mime-mail-parser#329

So I'm no longer using the code I posted here. Therefore I'm closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants