Check if "undeclared" encoding is really not UTF-8 before converting it to UTF-8 #133

ThomasLandauer · 2020-08-07T22:04:36Z

When I have an email body which is in fact UTF-8 but the email lacks any charset declaration, you treat is as ASCII and encode it to UTF-8. Which is in sync with RFC 2045:

Default RFC 822 messages without a MIME Content-Type header are taken by this protocol to be plain text in the US-ASCII character set

So $message->getTextContent() gives me something like:

fÃ¶Ã¶bÃ¤r

Thunderbird does the same. However, php-mime-mail-parser's $parser->getMessageBody('text') keeps it as-is:

fööbär

I'm "fixing" this in my code by checking: When I utf8_decode() your output (i.e. "reverse" the conversion you did automatically), is it still UTF-8?:

if (mb_check_encoding(utf8_decode($message->getTextContent()), 'UTF-8'))
{
    $text = utf8_decode($message->getTextContent());
}
else
{
    $text = $message->getTextContent();
}

The text was updated successfully, but these errors were encountered:

zbateson · 2020-08-07T22:58:01Z

I recommend overriding the default charset in use in that case:

$message->getHtmlPart()->setCharsetOverride('utf-8');
echo $message->getHtmlContent();

You can read more about default charsets here: #110

ThomasLandauer · 2020-08-11T13:35:22Z

That won't work in that case, cause in advance I don't know the charset either ;-)

zbateson · 2020-08-11T15:28:47Z

Yup, there's no easy fix. If you want to use mb_check_encoding yourself though, you're free to do so. There's no 'good way' of knowing if a charset is 'something' without some indication though, all that function does is check that bytes within the passed string 'pass' as the charset you give it, it doesn't know if that's the original charset.

ThomasLandauer · 2020-08-20T16:32:06Z

Yes, indeed. As far as I see it, mb_check_encoding() just checks if each byte (or byte sequence) is a formally valid character. So mb_check_encoding('fÃ¶Ã¶bÃ¤r', 'UTF-8') returns true. The only thing that mb_check_encoding() saves you from is totally invalid characters, as mentioned at php-mime-mail-parser/php-mime-mail-parser#329

So I'm no longer using the code I posted here. Therefore I'm closing this.

zbateson added the question label Aug 8, 2020

ThomasLandauer mentioned this issue Aug 11, 2020

Add configuration to set the default charset for content without a specified charset #110

Open

ThomasLandauer closed this as completed Aug 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check if "undeclared" encoding is really not UTF-8 before converting it to UTF-8 #133

Check if "undeclared" encoding is really not UTF-8 before converting it to UTF-8 #133

ThomasLandauer commented Aug 7, 2020

zbateson commented Aug 7, 2020

ThomasLandauer commented Aug 11, 2020

zbateson commented Aug 11, 2020

ThomasLandauer commented Aug 20, 2020

Check if "undeclared" encoding is really not UTF-8 before converting it to UTF-8 #133

Check if "undeclared" encoding is really not UTF-8 before converting it to UTF-8 #133

Comments

ThomasLandauer commented Aug 7, 2020

zbateson commented Aug 7, 2020

ThomasLandauer commented Aug 11, 2020

zbateson commented Aug 11, 2020

ThomasLandauer commented Aug 20, 2020