Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get decoded/unfolded semantically "full" header #122

Open
ThomasLandauer opened this issue Aug 4, 2020 · 32 comments
Open

Get decoded/unfolded semantically "full" header #122

ThomasLandauer opened this issue Aug 4, 2020 · 32 comments

Comments

@ThomasLandauer
Copy link
Contributor

I'm done with Date:, From: is next :-)

I have this:

From: =?ISO-8859-1?Q?f=F6=F6b=E4r?=
 <[email protected]>

And I want to get this:

fööbär <[email protected]>

Currently, there's no easy way to get it, right?

  • echo $message->getHeader('from'); gives:

    From: =?ISO-8859-1?Q?f=F6=F6b=E4r?=
     <[email protected]>
    

    See AbstractHeader::__toString()

  • $message->getHeader('from')->getRawValue() gives:

    =?ISO-8859-1?Q?f=F6=F6b=E4r?=
     <[email protected]>
    
  • $message->getHeaderValue('from') gives:

  • So all that's left is to use something like ->getAddresses()[0]->getName() and ->getAddresses()[0]->getEmail() and reassemble them manually, right?

php-mime-mail-parser/php-mime-mail-parser has a getHeader() which does exactly what I mean.

So I'm suggesting that besides giving the raw header, or the deconstructed parts, you should offer an "intermediate" function that gives the semantically full header (i.e. full header decoded and unfolded).

@zbateson
Copy link
Owner

zbateson commented Aug 4, 2020

Yes, I think your examples are right -- there's also a convenience method in AddressHeader to get the email/name of the first header, so you could $message->getHeader('from')->getPersonName() or ->getEmail() directly if memory serves me right 😝 but that's besides the point you're making hehe.

Again, not entirely against it -- so would it display kind of like an email client shows email addresses? "Blah user@email, Second user@secondemail"?

Perhaps AddressHeader::getDisplayValue or something?

@ThomasLandauer
Copy link
Contributor Author

Perhaps AddressHeader::getDisplayValue or something?

No, sorry, I didn't point that out clear enough: This is a general header issue, not just about From!

So if the email contains:

Subject: =?ISO-8859-1?Q?f=F6=F6b=E4r?=

There should be three functions to get:

  1. Subject: =?ISO-8859-1?Q?f=F6=F6b=E4r?=
  2. =?ISO-8859-1?Q?f=F6=F6b=E4r?=
  3. fööbär

For the naming, I'm proposing a new system:

All functions should follow the pattern ->getHeader()->getSomething()

  • ->getHeader()->getRawLine() for 1. The word "line" indicates that you'll get the full line, including Subject: . This would replace all occurrences of just ->getHeader() in README.md (cause Header is an object, and relying on __toString() is always "risky"). This would elegantly solve Remove header name from ->getHeader() #123 :-)
  • ->getHeader()->getRawContent() for 2. Same as ->getRawValue(), but I'd rename it for consistency - see below.
  • ->getHeader()->getContent() for 3.

  • I would drop (i.e. deprecate) ->getHeaderValue() cause it's inconsistent with this naming system. Besides, its current behavior is inconsistent between Subject and From headers.
  • The current ->getValue() is IMO dangerous/misleading, since for multiple addresses it returns just the first. So this should be dropped completely. ->getHeader()->getContent() should return the entire decoded content (as string). And for each single address (or other part), keep ->getParts()

@ThomasLandauer
Copy link
Contributor Author

Just noticed that RFC 5322 uses the term "body" for what I called "content":

Header fields are lines beginning with a field name, followed by a colon (":"), followed by a field body, and terminated by CRLF.

So maybe ->getBody() insted of ->getContent()? But this could lead to confusion with the email's body...

@zbateson
Copy link
Owner

zbateson commented Aug 5, 2020

I still don't see the value in changing __toString or it's meaning (the "string" representation of the header).

There should be three functions to get:

Subject: =?ISO-8859-1?Q?f=F6=F6b=E4r?=
=?ISO-8859-1?Q?f=F6=F6b=E4r?=
fööbär

But isn't there? __toString, getRawValue, getValue

I get that there's some inconsistency in getHeaderValue/getValue with addresses, but often for AddressHeader's you might only be interested in the first address, so I made it inconsistent so you didn't have to worry about AddressHeader potentially representing more than one address for a 'from' header for instance. I'm not sure what's worse though, both have their drawbacks imo :(

But yeah, that could've been a mistake. I'll give it some though. Maybe a convenient very explicit "getFirstAddress" "getFirstName" or something would've made more sense.

And yeah, if we're to change it we'd have to deprecate getHeaderValue and getValue and change them to getHeaderContent and getContent (or Body).

@ThomasLandauer
Copy link
Contributor Author

But isn't there? __toString, getRawValue, getValue

No, since for a To header, getValue() only returns (a) the first address, and (b) just the email part of that. This behavior is deceptive IMO, since it gives the false impression that the first address is all there is!

In which situations would somebody only be interested in the first address?

@zbateson
Copy link
Owner

zbateson commented Aug 5, 2020

I get what you're saying about AddressHeader being inconsistent with its getValue and acknowledge that. In your example though, Subject, getValue would give result 3.

I can think of quite a few examples where you might only care about a single address... the primary being if you're checking 'From' for an address, is it preferable to go if (count($fromHeader->getAddresses()) > 0 && strcasecmp($from->getAddresses()[0]->getEmail(), '[email protected]')) or is there value in being able to do if (strcasecmp($from->getValue(), '[email protected]')?

@ThomasLandauer
Copy link
Contributor Author

Yeah, sure the second is shorter - but how can you know in advance that the address you're looking for must be the first one?

@zbateson
Copy link
Owner

zbateson commented Aug 5, 2020

I think you'd rarely care if 'from' contained more than one address, I think that's very uncommon.

@ThomasLandauer
Copy link
Contributor Author

ThomasLandauer commented Aug 6, 2020

In your example though, Subject, getValue would give result 3.

Yeah, sorry, I missed that cause I was misled by the "errors" caused by #130

I think you'd rarely care if 'from' contained more than one address, I think that's very uncommon.

I was about to say that, but was afraid you'd come up with some mailing list (or whatever) example ;-)

Anyway: What about To and Cc?
The real point is: If getValue() (or getHeaderValue()) returns the entire content for most headers, it cannot be that it just returns just some part for some headers! That's really deceptive!

So what I'm mainly suggesting is to change the function's behavior. And since this is a BC break, you consequently need to change its name too. And this is a nice opportunity to rethink the entire naming concept.

@zbateson
Copy link
Owner

zbateson commented Aug 6, 2020

I get it and am not disagreeing on this one as well. I feel like 'getValue' is so standard that it might be preferred even if there's a problem with it though... and the deprecation and introduction of 'getContent' also seems at least as equally annoying to me as the problem it's trying to fix. Everyone will have to change their calls from getValue to getContent, and they're all already used to the existing functionality.

@ThomasLandauer
Copy link
Contributor Author

Well, I for myself don't care how you handle the transition :-) Semantic versioning probably demands the deprecation since it's a BC break.

BTW: How could you "send" a deprecation warning message at all? And where/how could users see it?

About the very name: The more I think of it, the more I'm torn towards getBody(). Why? Because it's the term the RFC uses! And nobody was ever fired for sticking to the RFC... ;-)

@zbateson
Copy link
Owner

zbateson commented Aug 6, 2020

The only deprecation I've done so far I've put up on the main readme (and main page of mail-mime-parser.org). I'd also put it up in the release notes for the specific version it changed in.

@ThomasLandauer
Copy link
Contributor Author

Here's the answer I was looking for: https://stackoverflow.com/a/194135/1668200

@zbateson
Copy link
Owner

zbateson commented Aug 6, 2020

Aah, I see what you're getting at -- yeah, I didn't do that. I only added an @deprecated annotation. That should be added to the deprecated functions I have too then 👍 even better in the comment on that one E_USER_DEPRECATED was mentioned as well.

@ThomasLandauer
Copy link
Contributor Author

Two more points about the naming:

  • Just noticed that $message->getHeaderValue('content-type') only returns text/plain if the full header is Content-Type: text/plain;charset=utf-8. So I think you're using getHeaderValue() for something like "give me the most important info from this header". It is OK to have a convenience function like that. But it shouldn't be called "getXValue" Can't think of a better name (yet), cause I don't know what it gives for other headers ;-)
    So are there more headers (besides To, From, and Content-Type) where getHeaderValue() doesn't return everything? If yes, I'd suggest to set up an overview table listing each header and each function and display what it returns. This will make it easier to find better names.

  • Since you're parsing the comments in the headers anyway, what about something like getBodyWithoutComments()?

@zbateson
Copy link
Owner

zbateson commented Aug 10, 2020

so I think you're using getHeaderValue() for something like "give me the most important info from this header"

Yeah, actually I think in that case it's different though too... if you consider the type of header Content-Type is, it's a header with optional additional 'parameters', so the value of the header is "text/plain", but it may have additional parameters like 'charset', which has a default value of 'us-ascii'... I think this is also different from my processing of emails which I agree is confusing.

You can see what I'm getting at here: RFC 1341 (I'm sure this must be mentioned in RFC822 and/or derivatives somewhere but I couldn't find it with a quick glance lol... anyway, specifically:

Content-Type header field value is defined as follows:
Content-Type := type "/" subtype *[";" parameter]

Note that the 'parameter' is optional, and can be different depending on type/subtype -- for multipart, it could be 'boundary', for content-type it could be charset, etc... the 'value' is one thing, and the 'parameters' are another. This is true for other parameterized headers too, like 'content-disposition'... it's 'value' could be 'attachment', and it could optionally have a 'filename' parameter for example.

Since you're parsing the comments in the headers anyway, what about something like getBodyWithoutComments()?

I think most people expect the comments to be removed in the value, they don't need to worry about that being explicitly stated imo.

@ThomasLandauer
Copy link
Contributor Author

the 'value' is one thing, and the 'parameters' are another

Well, another argument for "getBody()" :-)

I think most people ...

What I definitely can say is that most people don't know the difference between "value", "parameter", "comment" etc. for an email header! If I see something like getValue() I read it as "give me the stuff that's in there".

Comments:
A week ago I didn't even know that there is such a thing :-) The only encounter I had so far was the suffixed timezone name in the Date header (e.g. (GMT)) which sometimes causes problems, see php-mime-mail-parser/php-mime-mail-parser#328
If that's the only occurrence of comments in the wild, there's indeed no need for a getBodyWithoutComments(). Anybody interested in this piece of information could use getRawBody().

BTW: Is your decoding engine usable stand-alone? I.e. can I throw =?ISO-8859-1?Q?f=F6=F6b=E4r?= in somewhere?

@zbateson
Copy link
Owner

zbateson commented Aug 11, 2020

What I definitely can say is that most people don't know the difference between "value", "parameter", "comment" etc. for an email header! If I see something like getValue() I read it as "give me the stuff that's in there".

I'd say you're wrong about most people. Consider these examples:

https://javaee.github.io/javamail/docs/api/javax/mail/Header.html#getValue--

https://swiftmailer.symfony.com/docs/headers.html :

// note that value is one part, parameter is another
$headers->addParameterizedHeader(
  'Header-Name', 'header value',
  ['foo' => 'bar']
  );

The only instance of 'getBody' I've seen is here:

http://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/stream/Field.html#getBody()

And in this case it's like my getRawValue.

BTW: Is your decoding engine usable stand-alone? I.e. can I throw =?ISO-8859-1?Q?f=F6=F6b=E4r?= in somewhere?

You should be able to do something like this if you want:

$di = new \ZBateson\Container();
$mlpf = $di->getMimeLiteralPartFactory();
$mlpf->newInstance('=?ISO-8859-1?Q?f=F6=F6b=E4r?=');

You could also create an instance of \ZBateson\MailMimeParser\Header\Part\MimeLiteralPart, pass it a \ZBateson\MbWrapper\MbWrapper instance as first parameter, and encoded word as second parameter. My example here though gives you options with dependency injection (in case future readers are doing that).

@ThomasLandauer
Copy link
Contributor Author

Yeah, sure "getBody" is uncommon. But what's your suggestion?? IMO getValue is ruined now, since you sometimes use it indifferently (just give me whatever is in there), and sometimes you use it specifically (value vs. parameter vs. comment). So I'm guessing it's easier to educate people to use a new function, rather than understand that an existing did change.

Besides: Since you follow the RFCs where ever possible, why not take the names from them too?

@zbateson
Copy link
Owner

I never "just give you whatever's in there"... it's never the equivalent of "getRawValue".

The only instance it's a little different is the way it gives the first email rather than 'all emails and names' in a string.

@zbateson
Copy link
Owner

(unless the header doesn't require any processing of course)

@zbateson
Copy link
Owner

How about I deprecate getValue only in AddressHeader, introduce a getCompleteValue (again just for AddressHeader) in a 1.3... then in 2.0 I deprecate getCompleteValue and restore getValue with the functionality of getCompleteValue.

It might affect a few people upgrading from an older version to the latest, but still less than deprecating all of 'getValue', introducing a new method like 'getBody' or 'getContent' which to me definitely implies the whole header instead of the value of it for something like a parameterized header, and also means most people can just be 'aware' of the change without changing every call to 'getValue' to something else.

I'd say I'm between that or living with the inconsistency, and I'll give it some thought.

@ThomasLandauer
Copy link
Contributor Author

(unless the header doesn't require any processing of course)

This is exactly the problem! To me, a header is a header. I don't know which headers need processing...

How about I deprecate getValue only in AddressHeader

And what about Content-Type?

Seriously: If you want to get a clear, clean nomenclature, please set up a wiki page (is this still possible on GitHub?) with a big table: All headers and all your functions; and in each cell, what the function returns. This will give us an overview which cases we need to cover. This table will ultimately move to README, so it's no lost work but a great documentation for the future :-)

And maybe we should also get Vincent on board, cause you said that was the original idea behind https://gitlab.com/mailcare/parser ? Might take a while till we reach a consensus, but on the other hand: If you two agree on a common interface, this is going to be the de-facto standard for any PHP mail parsing project!

@zbateson
Copy link
Owner

This is exactly the problem! To me, a header is a header. I don't know which headers need processing...

No, you misunderstood me...

Subject: this doesn't need any processing
Subject: =?utf-8?Q=but_this_does?=

I'm always processing headers, some look the same regardless though (first example)... in that case there's no difference between the two.

Seriously: If you want to get a clear, clean nomenclature, please set up a wiki page

Yup, I can set it up for you if you want to undertake that 👍

And maybe we should also get Vincent on board, cause you said that was the original idea behind

That's his project... he's not currently doing much with headers, I think he was more interested in using my header parser (when that was being discussed).

@zbateson
Copy link
Owner

You can have a look at what type of Header object is created for what type of header here:

https://github.com/zbateson/mail-mime-parser/blob/master/src/Header/HeaderFactory.php

The default is a "GenericHeader" which filters out comments and processes quoted parts.

@ThomasLandauer
Copy link
Contributor Author

Yup, I can set it up for you if you want to undertake that

OK, please!

@ThomasLandauer
Copy link
Contributor Author

I think most people expect the comments to be removed in the value

Some data I've just stumbled upon: For the Auto-Submitted header (RFC 3834), adding a comment seems to be quite common: 16% of the emails I'm analyzing are having one. And it indeed adds some value (pun!) to the information:

auto-generated (confirmation)
auto-replied (vacation)

@zbateson
Copy link
Owner

The information in that case is still available via getRawValue if you want it that way of course, but parts are accessible separately too via getParts https://mail-mime-parser.org/api/1.2/classes/ZBateson.MailMimeParser.Header.GenericHeader.html --

That returns all the parsed HeaderPart objects, including any CommentPart objects... although having a 'getComments' defined in AbstractHeader might make that easier.

@zbateson
Copy link
Owner

Hi @ThomasLandauer --

I've had a chance to look at this some more and think about it and the behaviour isn't quite the way you described it.

'getValue' actually consistently returns the value of the first part of a header. Consider:

  • ParameterHeader returns the value of the first part, e.g. 'text/html' for a Content-Type header, not the value of all parts
  • AddressHeader returns the value of the first part (the first part being an AddressPart or AddressGroupPart, the 'value' of an Address being the email address)
  • all other headers only contain a single part.

You can see that this is true in that AddressHeader doesn't override 'getValue', and neither does 'ParameterHeader'. The default implementation of getValue does this in AbstractHeader:

if (!empty($this->parts)) {
    return $this->parts[0]->getValue();
}
return null;

This actually does make sense to me and I think it's still my preferred way... it allows a user to not have to handle knowing if an address contains more than one address in some cases like with a From. If the user expects more than one address, they can specifically handle it.

Having said that, it may be useful to have a convenience method to get all the names/addresses in an email field after decoding, that displays them like Thunderbird would or something. I'm a bit hesitant though because it has to make some decisions for the user -- how should I display this email address for example: "My <weird> Name" <[email protected]> (or an address with mime-encoded non-ascii characters?)... I think what makes sense to me is to return the name always quoted, with non-ascii characters in mime encoded parts included in the quotes as well. It may not make sense or be useful to everyone though.

@fkoyer
Copy link

fkoyer commented May 25, 2022

I know this is an old topic but I am having a similar issue. There seems to be no way to get the "full" decoded and unfolded contents of a header. For example:

Header

This is the header as it appears in the rfc2822 message:

Authentication-Results: cf04.mxguardian.net;
	dkim=pass (2048-bit key)
	header.d=effectwebagency-com.20150623.gappssmtp.com

Desired output

What I would like to retrieve is the full unfolded header without the header name, like so:

cf04.mxguardian.net; dkim=pass (2048-bit key) header.d=effectwebagency-com.20150623.gappssmtp.com

Code

$result = $message->getHeader('Authentication-Results');
echo $result->getValue(),"\n\n";
echo $result->getRawValue(),"\n\n";
echo $result->__toString(),"\n\n";

Output

cf04.mxguardian.net

cf04.mxguardian.net;
        dkim=pass (2048-bit key)
        header.d=effectwebagency-com.20150623.gappssmtp.com

Authentication-Results: cf04.mxguardian.net;
        dkim=pass (2048-bit key)
        header.d=effectwebagency-com.20150623.gappssmtp.com

As you can see, none of these methods returns the complete header on one line. It seems like my best option is to use getRawValue and do the unfolding myself, however, getRawValue doesn't do any MIME decoding. So if I'm working with a header that is MIME encoded then getRawValue doesn't work and I have to use a different workaround.

Can we have a method such as getDecodedAndUnfoldedValue that works with any header?

I'm using version 2.1.0.

Thanks
Kent

@zbateson
Copy link
Owner

zbateson commented May 25, 2022

Hi @fkoyer,

Looks like you just need a quick regex for it: echo preg_replace('/\s+/', ' ', $result->getRawValue());.

Edit: should really read things properly before responding, sorry.

@fkoyer
Copy link

fkoyer commented Jun 1, 2022

Hi @zbateson,

Thanks for the quick response. Yes, I can definitely do the unfolding myself. It would just be handy to have a function that returns the full decoded and unfolded header text so that we don't have to do it every time we access a header. TBH, that's what I expected getValue() to do. It's too late to change that now but another convenience method would be a welcome addition.

Thanks
Kent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants