-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSS font-matching algorithm may introduce fingerprinting issues #1202
Comments
The Timed Text Working Group just discussed
The full IRC log of that discussion<nigel> Topic: CSS font-matching algorithm may introduce fingerprinting issues imsc#530<nigel> github: https://github.com/w3c/imsc/issues/530 <nigel> Nigel: Did we actually introduce CSS font matching algorithm? <nigel> .. I see at https://w3c.github.io/imsc/imsc1/spec/ttml-ww-profiles.html#text-font-source <nigel> .. that we introduced: <nigel> .. "A Processor MAY use the [css-fonts-3] §5 font matching algorithm for associating a font with a run of text." <nigel> .. My question is, if this is an option, not a requirement, why wouldn't the CSS handling <nigel> .. of the privacy issue be implied by reference. <nigel> Pierre: Just to point out that in §10.5 we mention the CSS font matching algorithm <nigel> .. is also referenced via a defined term Font Matching Algorithm. <nigel> .. Editorially we should improve that. <nigel> Nigel: Right, and that's in the HRM section. <nigel> .. The HRM considerations are in my view concerned with document validation, and there's <nigel> .. no requirement for the presentation processor to follow any steps in the HRM to <nigel> .. render content. <nigel> .. I would not expect a user-oriented player to execute the steps of the HRM. <nigel> Andreas: +1 <nigel> Nigel: And therefore there's no privacy issue associated with 10.5. <nigel> .. That takes us back to 8.5.3. <nigel> Pierre: To your earlier point Nigel, I don't see what action we can reasonably take. <nigel> .. There are a lot of "mays" and "under discussion" and no proposed resolution. <nigel> -> https://lists.w3.org/Archives/Public/public-tt/2020Mar/0013.html Email that prompted this issue <nigel> Nigel: There are additional questions in the email that are not in the GitHub issue. <nigel> Pierre: We have generic text in TTML2 about loading of resources, I believe. <nigel> Glenn: There are some handwavy statements <nigel> Pierre: About resource fetching? <nigel> .. In the absence of specific concerns we can only offer generic guidance. <nigel> Glenn: Exactly. <nigel> .. I don't know what we can practically say. <nigel> Pierre: We can ask about specific issues with the TTML2 text. <nigel> Glenn: Ask for spec-ready text we can drop in. <nigel> Pierre: Exactly, that's what we should do. <nigel> .. We can't tell CSS and HTML how to do fingerprinting mitigation. <nigel> SUMMARY: TTWG thanks @npdoty for raising this. In the context of continuing discussions and without understanding any specific improvements we can currently make, we will proceed with no changes for the time being. <nigel> SUMMARY: Discussion of additional questions raised in the linked email to continue offline. |
We generally try to provide privacy and security guidance even for optional normative text that isn't required (MAY rather than MUST, for example). And we generally try to note privacy issues in all the places they appear, even if they might be mitigated or resolved in the future. It might also be that the fingerprinting risk that does apply with CSS in the Web context doesn't apply with processors of TTML/IMSC, but I haven't been able to determine that as I'm less clear on how these processor implementations work in connection with the Web platform. I think that would be a useful discussion to have (via email or teleconference) and might help us provide better guidance on #1189 as well. |
Answering some of the questions in your email @npdoty :
Can the origin server obtain the rendered text in any way?There's nothing in TTML or IMSC about this - it would be an implementation feature beyond anything in the specification. Can it see the height or size of the region?The origin server, in providing the subtitle document, is defining the size of the region, and the size of the text within it. This does not give complete information about the rendered result, because text layout engines vary on a pixel-by-pixel basis, and because the used fonts may differ. Furthermore, as part of the document processing context, the user may have had the option to specify some overrides to the document-specified formatting. There is nothing in TTML or IMSC that defines any return path to the origin server for such overrides. Again, this would be an implementation-specific behaviour. Are there conditional requests based on which fonts are available or if a region is overflowed?No, there are no such conditional requests defined. TTML2 has a Other fingerprinting opportunitiesGoing beyond your email, I've been wondering if there are other fingerprinting opportunities - please forgive me ignorance in this general area: I am very far from an expert in this privacy regime. At a real pinch, might it be possible to construct a "pathological" case in which a set of URLs is provided for a font resource, using the available fallback behaviour, and IMSC documents are authored such that the way those fallback URLs are requested reveals some information? This is fairly far-fetched and not well thought through right now. It would be easier to work against specific fingerprinting concerns than generic ones. In general any fingerprinting opportunity some malicious actor might be able to use would almost certainly be much easier to use through some other mechanism! For example if the document is requested as part of playback of video media in the context of a web page, there are probably many opportunities to fingerprint within that web page already. It is hard to see why anyone would try to use some feature of IMSC document playback in this context. To make this point more concrete, consider a web-based video player: one could hook into some IMSC player feature to send reporting events back to an origin about the user's playback point, but there's no need to be so obtuse - there are plenty of opportunities in video player code to do this already regardless of the presence of subtitles or captions. Likewise, any IMSC player that supports some kind of beyond-the-specification customisation user interface can send reporting data on the usage of that interface directly back to an origin if it has been implemented to do so, whether or not IMSC document playback is actually taking place. |
For IMSC1.2 would it be sufficient to add an editorial note in 8.5.3 pointing to the (currently still open) w3c/csswg-drafts#4055 in CSS? |
@swickr It is still not clear to me that the attack vector indicated at w3c/csswg-drafts#4055 is relevant to IMSC. Specifically, it looks like the attack vector requires a malevolent script accessing the user's font list. Is that correct? If so, IMSC does not specify any such scripting capability and/or any API that would allow the user's font list to be accessed. @nigelmegitt at https://github.com/w3c/imsc/issues/530#issuecomment-601853057 suggests a different kind of attack where a malevolent site generates a large number of specially crafted IMSC document referencing font resources on the malevolent site, with the objective of determining the user's font by observing which font resource the TTML processor attempts to download from the malevolent site. Is that worth mentioning? If so, this attack could be mentioned in IMSC 1.2 temporarily, and ultimately moved to TTML 2 since it applies to any TTML 2 profile that supports downloadable fonts. |
(In my understanding,,,) fingerprinting is a point(s) of difference in user environment which can categorize a specific execution environment into some groups, like which language (Accept-Language) is configured in an instance, and CSS font fingerprinting is to use which local font file is installed and available from web browsers etc. as this point, by configuring CSS (+JS if needed) to tell whether specific font is loadable from html content. |
AFAIK we are not sure at this point what is the attack and how to mitigate it, so the best we can probably do today is add an editor's note merely pointing to this issue. See proposed note at w3c/imsc#532. We can then get to the bottom of the issue in the coming weeks. |
@palemieux that is not what I suggested: rather, I suggested that the users's location might identifiable through this highly circuitous route. As I understand them, the semantics for downloading external font resources are completely independent of the installed fonts. In other words, if some text is styled with a font family that dereferences to an external font resource via a |
@swickr please could you give us more information about how such a note might be helpful? We generally try not to include speculative comments or references to not-concluded conversations in Recs if we can help it. In this case, the whole thread seems to refer to something that is a non-issue with IMSC 1.2 and TTML2, as far as I have been able to tell so far from the discussion. I wonder if anyone is able to describe succinctly what fingerprinting vector is in fact exposed by IMSC 1.2's use of the TTML2 I think this is key because in general, specifying something in a subtitle/caption document does not in itself reveal anything; only the execution of implementations can reveal anything, and in this case I have not been able to locate any processor semantic defined by the specification that could or would reveal anything about installed fonts. I would be happy to have it shown to me though, if there is one! |
We typically try to note security and privacy issues even if those issues also apply to other likely features (like a Web page that uses CSS and has a risk of fingerprinting): it provides guidance to implementers so that they know the trade-offs when implementing and it provides a marker of the problem so that if it's resolved in another spec, the remaining threat or vulnerability is documented. And I'm not sure about the distinction between the spec and the implementation. The privacy issues that we note in HTML or CSS or other Web specs only exist because they are implemented in particular software and the implemented software has typical (or optional or required) implementations that create privacy risks that we think are worth noting and mitigating. Definitions of markup languages can have relevant privacy considerations, even though they just define markup, based on how that markup will be consumed. In the case of CSS font fingerprinting, that's typically not based on just a direct JavaScript call, but on having the browser render some text in a particular font with a particular fallback and then testing the size of the resulting element (that's why I was asking about rendered text, size and conditionality, because those are methods often used in browser fingerprinting). Whether external resources are loaded or not is also a way for a constructed document to send a signal to an external server about the configuration of the user's machine. To the question from @palemieux and @nigelmegitt, I don't know whether specifying a font of a particular name and providing an external source for it would imply that it should be downloaded only if a font of that name is not present locally. It doesn't seem like that, but I'm not sure how to read it exactly. (There could be related issues about caching of resources (determining whether the user has viewed this content before based on whether those external resources are fetched or not) that are relevant to any markup of external resources that are cached with HTTP, but those are typically less severe and I don't know that we have a corresponding issue for you to refer to.) |
My concern is that we end up documenting generic vulnerabilities in IMSC. Such vulnerabilities are best described in a generic document -- just as WCAG documents generic accessibility requirements.
This is made possible by programmatic access to the DOM, completely independently of the characteristic of the source document, right? For example, it would apply to a text file or an image. In other words, the vulnerability is not created by source document, but by the platform that allows programmatic access to rendered content? |
My understanding is that specifying an external source for a particular font name would in effect "hide" any locally-installed font of the same name. (This is certainly the case for the analogous case in HTML/CSS of font families defined via the However, it's still possible to "fingerprint" the locally-installed fonts, by a slightly indirect method: the document can specify a font-family list with two names, the first of which is the font name it is interested in probing, and the second is linked to an external source. So to detect whether, say, Zapfino is installed on the user's system, the document says something like By testing for the presence of a selection of font family names in this way, the server can potentially learn a lot about the user's installed font collection. |
Thanks @jfkthame that really helps to explain the mechanism for fingerprinting. I'm not clear whether TTML2 and IMSC can suffer from that mechanism, but it certainly seems plausible if not likely. |
(Just to be clear, that's not the only font-related fingerprinting mechanism; I believe the strategy of measuring the rendered size of a string of text in a particular font, and/or containing specific "interesting" Unicode characters, is currently the commonly-used method. But the approach outlined above is particularly interesting in that it does not depend on using APIs to measure or examine the rendered text, so it's immune to some suggested mitigations such as spoofing measurement results.) |
On Fri, Mar 27, 2020 at 4:52 AM jfkthame ***@***.***> wrote:
To the question from @palemieux <https://github.com/palemieux> and
@nigelmegitt <https://github.com/nigelmegitt>, I don't know whether
specifying a font of a particular name and providing an external source for
it would imply that it should be downloaded only if a font of that name is
not present locally. It doesn't seem like that, but I'm not sure how to
read it exactly.
My understanding is that specifying an external source for a particular
font name would in effect "hide" any locally-installed font of the same
name. (This is certainly the case for the analogous case in HTML/CSS of
font families defined via the @font-face rule.) The name then refers
*only* to the external source.
However, it's still possible to "fingerprint" the locally-installed fonts,
by a slightly indirect method: the document can specify a font-family
*list* with two names, the first of which is the font name it is
interested in probing, and the second is linked to an external source.
So to detect whether, say, Zapfino is installed on the user's system, the
document says something like tts:fontFamily="Zapfino,MyExternalResource",
where MyExternalResource is defined via <font
family="MyExternalResource"><source src="..."> to point back to a
(non-cacheable) resource with a unique URL (e.g. with an appended fragment
identifier used as a key) on the server. If that resource gets requested,
then the server knows Zapfino was *not* installed.
This is not a reliable test mechanism. Zapfino may be installed but not
used for a variety of reasons and MyExternalResource subsequently
referenced. For example, Zapfino may not have a glyph that corresponds to a
character being rendered. Or the font selection strategy may require a
contextual character sequence be mapped that is only available in the
external resource but not Zapfino. Or the writing mode may be vertical
mode, and only MyExternalResource supports vertical metrics. I could cite
dozens of other reasons why Zapfino might be ruled out by a client but
still loaded before moving on to the external resource.
… By testing for the presence of a selection of font family names in this
way, the server can potentially learn a lot about the user's installed font
collection.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/w3c/imsc/issues/530#issuecomment-604936121>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAC4E36WXQJHQUSVXN6UAUTRJSAOZANCNFSM4LPK3DEQ>
.
|
A site using such a mechanism to accomplish installed-font fingerprinting would presumably apply the "test" styling to specific simple content such as a single ASCII character in horizontal writing mode, so that such considerations aren't relevant. In addition, the fact that a fingerprinting mechanism may not be 100% reliable is not sufficient to prevent malicious sites using it, or to protect users. It just needs to work fairly well much of the time in order to be a significant threat. |
This still depends on reliance upon a heuristic that implementations choose to implement lazy fetch algorithms, which is entirely implementation dependent, i.e., outside the realm of the entire set of TTML and IMSC specifications. |
@npdoty See PR for your review. |
@nigelmegitt re: #1202 (comment), it is not clear to me that the comments of @samuelweiler represent a consensus PING position or represent his personal opinion; in any case, there are many precedents that permit us to decline to process his request for a normative change; we can simply resolve this by stating that the TTWG position is not to satisfy the requested change at this time; nothing in the process forces us to accept the change (in general or in the context of this specific CR); |
@skynavga I agree, that is a possible course of action. As Chair, I am attempting to ensure that we have exhausted all routes to getting to a consensus view, and that includes @samuelweiler 's view regardless of whether it is a PING position or a personal one. If I am satisfied that we have exhausted all routes, then that only leaves the option that you describe. |
@npdoty, @samuelweiler As I understand the motivation of PING is to provide privacy and security guidance, in this case on strategies to avoid fingerprinting issues in the context of font downloading. As I understand the discussion in #1203 one of the questions is, if guiding text is made normative. @npdoty proposed the following:
As the overall goal is to guide implementers in the right direction, could the following be an alternative (added as a Note):
This has essentially the same meaning (it uses the definition of SHOULD NOT in https://tools.ietf.org/html/rfc2119). The only difference is that it does not use normative keywords. But the text may highlight the guiding aspect even better? |
@npdoty , @samuelweiler One additional option could be a more detailed guideline on how to avoid the fingerprinting on MDN (e.g. as a separate page in the IMSC chapter, https://developer.mozilla.org/en-US/docs/Related/IMSC). This would have the advantage that solutions can be updated more frequently, security, and TTML experts could work collaboratively on it and (at least in my opinion) the reach to implementers will be possibly better than in the specification itself. |
@TairT re: #1202 (comment), I can accept your proposed language provided that: (1) change "It is strongly encouraged to NOT" to read "It is recommended that the the document processing context not", (2) change "the case was" to "the case is", and (3) appendix P remains non-normative. I should point out that we have precedent (in five notes) for the language "it is recommended" in other non-normative contexts in the specification text. |
Note that "document processing context" here should be linked to the terminology section, i.e.,
|
The Timed Text Working Group just discussed
The full IRC log of that discussion<nigel> Topic: CSS font-matching algorithm may introduce fingerprinting issues #1202 (PING review)<nigel> github: https://github.com//issues/1202 <nigel> Nigel: Some activity to report: <nigel> .. 1. Sam got back to me earlier today or late yesterday proposing times for a joint meeting. <nigel> .. 2. Andreas proposed an alternative, stronger-sounding wording, which Glenn thought <nigel> .. could work modulo a couple of editorial tweaks. <nigel> .. Sam proposed 1:45pm Eastern. That's a little late for me, he suggested the earliest <nigel> .. possible time would be 1:30pm Eastern, but next week might work too. <nigel> .. For a half hour call. <nigel> .. I will respond to explore the options for a suitable time. Possibly it will be next week. <nigel> .. I will propose a doodle, since several people may want to attend. <nigel> .. Hopefully this will allow us to understand each others' objectives and constraints and <nigel> .. work towards a consensus solution. <nigel> .. Thank you Andreas for your proposals too. They look good to me also. <nigel> Andreas: No response to my comments, other than from Glenn. <nigel> Nigel: Good, let's hope that we have a path out of this. <nigel> SUMMARY: @nigelmegitt to respond to Sam regarding a joint meeting, to try to arrange it. |
The Timed Text Working Group just discussed
The full IRC log of that discussion<nigel> Topic: CSS font-matching algorithm may introduce fingerprinting issues #1202<nigel> github: https://github.com//issues/1202 <nigel> Nigel: I finally got round to setting up a doodle for this, not everyone has been able to <nigel> .. respond yet. <nigel> Pierre: Unfortunately I cannot make the two current most likely dates. It looks like Sam has the most restricted availability. <nigel> Andreas: I agree with Pierre, Sam's availability is most restricted, so maybe we should ask <nigel> .. him for some proposed slots in the next two weeks? <nigel> Nigel: Good idea, I will. <nigel> SUMMARY: @nigelmegitt to ask @samuelweiler for additional proposed slots. <nigel> Andreas: I wonder if our meeting would be an option too? <nigel> Pierre: Regrets from me for Thursday 23rd July, most likely. I'd be available following the meeting. <nigel> Nigel: That's an option I could add. |
Discussed on a call on 2020-07-27, minutes at https://www.w3.org/2020/07/27-tt-minutes.html Chair's summary, based also on @plehegar 's statements at the end:
|
I've drafted a new PR (#1210) that attempts to address the comments from PING, but without going as far as making the language normative. Nonetheless, I have included language "should consider not", which, in the present context (Appendix P), has a non-normative status. I would be willing to go as far as changing this to "should not" if folks prefer that. N.B. As I mentioned on today's call, we have precedent for using the language "should not" in non-normative text, so doing so would not introduce new precedent. |
Why the worry about precedents? |
@samuelweiler because we are a WG with 17 years of history which includes a history of established consensus about how to write specifications, what should and should not go into specifications, how testing is approached and a myriad of other details the sum of which form the basis for what traditional standards development organizations, such as ITU, ISO, ANSI, and others consider fair and best practice; in other words, it's our body of convention; the PING, the IETF, other SDOs, as well as individual editors, have their own conventions... you will find many distinct conventions within the W3C; for example, the HTML WG was comfortable publishing a spec (HTML5) that was largely untested and perhaps untestable in a significant way; however, the TTWG has not been comfortable in doing so, as was mentioned by @nigelmegitt in our recent call: that represents a difference of convention, or, a difference in the role of precedent as it were |
Draft language to address font fingerprinting mitigation (#1202).
Review of TTML2 2nd Edition noted many potential fingerprinting vectors: #1189
(Whether those issues present a privacy risk depends on a clearer understanding of what information is revealed by content processors to whom.)
Addition of external font loading and the CSS font-matching algorithm could introduce those fingerprinting issues to IMSC 1.2.
Mitigations for fingerprinting in CSS are under discussion now in CSSWG and PING.
More info in email: https://lists.w3.org/Archives/Public/public-privacy/2020JanMar/0055.html
The text was updated successfully, but these errors were encountered: