Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configuration to set the default charset for content without a specified charset #110

Open
johnss opened this issue Feb 21, 2020 · 18 comments

Comments

@johnss
Copy link

johnss commented Feb 21, 2020

it seems that QP encoding cannot support multi equal sign it only support 1 equal sign

for example =E2=80=93 should convert to – but it shows â��

@zbateson
Copy link
Owner

Hi @johnss --

Is this a mime-encoded quoted printable part, or part of a message body? What encoding is used for the part? Preferably a full example would help me test it/confirm the issue...

All the best

@johnss
Copy link
Author

johnss commented Feb 22, 2020

Part of message body using quoted printable as content transfer encoding via getHtmlContent() method, UTF-8 as html encoding

@johnss
Copy link
Author

johnss commented Feb 22, 2020

I created it using android chrome and save it as mhtml, it actually saved pages of x.com but i modified it to reproduce this issue.
It show – using chrome, â�� using getHtmlContent() rename to .mhtml or .mht extension to view it in chrome
x.com.txt

@johnss
Copy link
Author

johnss commented Feb 22, 2020

here is bin2hex result
bin2hex('–'); // e28093
bin2hex('�') // c3a2c280c293

@zbateson
Copy link
Owner

Hi @johnss,

The html part of the message in your example doesn't correctly define a charset. You can manually override that if you want by calling setCharsetOverride, for example:

$message->getHtmlPart()->setCharsetOverride('utf-8');
echo $message->getHtmlContent();

All the best.

@johnss
Copy link
Author

johnss commented Feb 24, 2020

what you not mention it docs? please add it to documentation

@johnss
Copy link
Author

johnss commented Feb 24, 2020

setCharsetOverride only mentioned in api docs generated by phpdocumentor, which many people rarely visit those pages, so many dev are not aware that method exist, please mention to pages with higher traffic

@johnss
Copy link
Author

johnss commented Feb 25, 2020

what encoding used when setCharsetOverride is not set? utf-8 is de facto standard used by nearly all web sites, why not default to utf-8?

@johnss johnss changed the title Quoted Printable encoding error UTF-8 as default charset when charset not defined Feb 25, 2020
@zbateson
Copy link
Owner

Hi @johnss,

It's not a bad suggestion -- my understanding is UTF-8 is fully backwards-compatible with ISO-8859-1. In researching this a bit, I couldn't find a reason not to default to UTF-8, but also it surprised me that Thunderbird defaults to ISO-8859-1 given they're fully compatible.

I think the ideal would be to have the default configurable rather than setting an override for a single email... and have the default configured charset UTF-8.

I'd be interested to hear from others more knowledgeable on this -- any reason why we shouldn't default to UTF-8?

@zbateson
Copy link
Owner

zbateson commented Apr 10, 2020

Looking more closely at this, UTF-8 and ISO-8859-1 are only the same for 0-127 (ASCII). This causes problems if an email contains non-ASCII characters and expects the default to be considered ISO-8859-1 instead of UTF-8. Setting the default to UTF-8 causes tests/_data/emails/m0009 to fail, but not tests/_data/emails/m0008 -- m0009 is ISO-8859-1 encoded without specifying a charset, m0008 is UTF-8 encoded. You can also note the differences in the files as they're the same text, the UTF-8 variant uses multiple bytes to encode codepoints above 127, whereas the ISO-8859-1 variant doesn't.

Instead, the option could be available though to change the default if you're interested in submitting a pull request.

@zbateson zbateson changed the title UTF-8 as default charset when charset not defined Add configuration to set the default charset for content without a specified charset Apr 10, 2020
@ThomasLandauer
Copy link
Contributor

I read the RFC (see #133 (comment)) as if you use non-ASCII characters you must declare a charset in the Content-Type header. Right?

@zbateson
Copy link
Owner

Yeah, although there's no harm in expanding that to either ISO-8859-1 or UTF-8, as they're both compatible for the first 127 bytes.

@ThomasLandauer
Copy link
Contributor

First: I'm not sure if ->setCharsetOverride() is actually doing what you had in mind. I'm understanding #133 (comment) that it would only set a default charset (i.e. only makes a difference if there is no charset declaration in the mail). However, it actually overrides whatever is defined in the mail.

Second:

any reason why we shouldn't default to UTF-8?

To sum it up, the situation is: The RFC demands that you declare a charset if you use non-ASCII characters. And your question is: If somebody does not stick to this (i.e. no charset declaration), which one should you use as the default?

I wanted to provide some data for this from the mails I'm currently analyzing. (They're mostly German, so probably every single one does contain some non-ASCII characters.) Well, but since ->setCharsetOverride() isn't doing what I thought it would do (see above), there are no results ;-)
If you include a function that really just sets the default charset, I could try again.

@zbateson
Copy link
Owner

The point is though, that you can check if a charset isn't set, and use setCharsetOverride if it isn't, thereby setting your own default charset using that.

@zbateson
Copy link
Owner

I don't know why there are no results on your specific case and emails without further details of what you're doing.

@ThomasLandauer
Copy link
Contributor

and use setCharsetOverride if it isn't, thereby setting your own default charset using that.

Well, if I override the existing charset, it's not "default" anymore! Default means: If there is no value, use this one.

Do you want me to run this check at all? If yes, please give me the code part I'm missing: Check if there is a charset declared - for the entire message or just for the text/plain part.

@zbateson
Copy link
Owner

Well, if I override the existing charset, it's not "default" anymore! Default means: If there is no value, use this one.

We're running in circles a bit here 😛 .

I said "you can check if a charset isn't set, and use setCharsetOverride if it isn't"

You can call $part->getHeaderParameter('content-type', 'charset'); and check if the return value is null.

@ThomasLandauer
Copy link
Contributor

I can now report from the emails I'm analyzing:
Before, 0.36% had text in UTF-8, but without a charset declaration (and were therefore displayed wrong).

With the code from #136 (comment) I now have 0.02% that have text in ISO-8859-1 (or similar) without a charset declaration (and are therefore displayed wrong).

So implementing what this issue asks for (a configuration to let the user set the default charset to e.g. UTF-8) is a good idea IMO, since it reduces the problem cases by more than factor 10.

Just for the records: It looks like most (German speaking) companies that do not declare a charset, send text in UTF-8 (rather than ASCII as the RFC says).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants