Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

double quotes " can break decoding #159

Open
markusramsak opened this issue Jan 5, 2021 · 6 comments
Open

double quotes " can break decoding #159

markusramsak opened this issue Jan 5, 2021 · 6 comments

Comments

@markusramsak
Copy link

the following simplified original version CANNOT be parsed correctly because of the closing quote in the "From: " line.

Delivered-To: [email protected]
Date: Thu, 10 Sep 2020 09:29:57 -0400
To: <[email protected]>
From: "Amway =?utf-8?q?=C3=96sterreich"?= <[email protected]>
Subject: Amway Newsletter Nr. 18 - 10. September 2020
Message-ID: <[email protected]>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: Quoted-Printable

if I move the closing quote after ?=, it works.

Delivered-To: [email protected]
Date: Thu, 10 Sep 2020 09:29:57 -0400
To: <[email protected]>
From: "Amway =?utf-8?q?=C3=96sterreich?=" <[email protected]>
Subject: Amway Newsletter Nr. 18 - 10. September 2020
Message-ID: <[email protected]>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: Quoted-Printable

please fix that so the parser can handle this.

@zbateson
Copy link
Owner

zbateson commented Jan 5, 2021

Hi @markusramsak --

A quoted part takes precedence. Specifically, "An 'encoded-word' MUST NOT appear within a 'quoted-string'.", see https://tools.ietf.org/html/rfc2047#section-5

I believe what you're trying to say is the mime-encoded part isn't "decoded", but that's correct behaviour as far as I'm aware. It would be hard to build an exception for what you want without breaking what should be considered valid because the quotes are supposed to take precedence at least as far as I can tell.

Feel free to correct me with relevant examples, including handling by popular mail parsers or clients, or rfcs or other libraries that specifically are handling your situation differently to facilitate a discussion about it.

@markusramsak
Copy link
Author

I know that it shouldn't happen but I am the programmer of a mail client with more than 100.000 mails to parse and display and the only thing I can say is, it happens.
I just simplified the mail but the issue is real in every newsletter email from the company Amway (https://www.amway.at)

Other mail clients like gmail oder Apple Mail could decode this mail subject correctly - and I would like too.

Maybe it is just a matter of replacing "?=[space] by ?="[space] but I don't know if it would break anything

@zbateson
Copy link
Owner

zbateson commented Jan 5, 2021

Unfortunately the way the parser works, the 'part looking for quotes' is separate from the 'part looking for mime encoded parts'. It's semantically okay for a mime-encoded part to have a quote in it, it just won't be handled as a 'control character' terminating (or starting) a quoted-part.

@markusramsak
Copy link
Author

if it can't be done on your side, than I would implement on my side to replace these wrong characters in the "From " line before it is parsed by your parser.
I would call it "preparsing" because it happens before your complex parsing.

@markusramsak
Copy link
Author

by the way you did an excellent job with this library! About 9995 out of 10000 emails can be parsed on average from my web mail client (backed by your library) without any issues.

@zbateson
Copy link
Owner

zbateson commented Jan 5, 2021

if it can't be done on your side, than I would implement on my side to replace these wrong characters in the "From " line before it is parsed by your parser.

I'm not sure that it can't, but it would be an effort -- I'd have to change the precedence of how things are parsed, which would make some valid but extremely unlikely cases invalid, like From: "My =?utf-8?Q?"weird"?= name" <[email protected]>... (i.e. purposely containing what looks like a mime-encoded part in a name) but I can't imagine that would ever be an issue... there may be other things affected too because of how the parsers are built, it would have to be investigated.

If you're able to sanitize for exceptions you know of like that, I think that would be the way to go at least for now... we can leave this open and look when there's time or if it's affecting more people. You could also try emailing the folks at Amway to tell them there's an issue with their emails :) maybe they're using a house-built system that needs to be patched, or maybe it's a huge commercial system that means handling this scenario should be prioritized.

by the way you did an excellent job with this library! About 9995 out of 10000 emails can be parsed on average from my web mail client (backed by your library) without any issues.

Excellent, very happy to hear that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants