-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What's in a name (also "reference", etc)? #448
Comments
With regard to your second point, are
expected to be treated differently by AT? (Offhand, I'd expect the same treatment) |
While I prefer permissive to restrictive, I agree that it reserving some characters for future use is a good idea. Candidates might include the categories: keyboard-top-row, punctuation, fences, quotes. To the extent those categories make sense and to the extent that they wouldn't reasonably be needed in normal mathematical phrases. |
I'd split the issue into two sub-components:
|
<mo>⍅</mo>
<mo intent="⍅">x</mo>
expected to be treated differently by AT? (Offhand, I'd expect the same treatment)
On the call I attempted to get clarification on that point.
My interpretation of the answer is that, unless there is a specific
indication that AT is allowed to ignore the intent, it will always use
the intent and ignore the actual content.
A similar example which is likely to arise is
<mo>×</mo>
<mo intent="x">*</mo>
Is intent="×" different than intent="times"?
Maybe the first is preferred, for internationalization.
|
I thought we could assume (even if it is not currently true) that core math blocks such as Mathematical Operators 22xx and perhaps Miscellaneous Technical 23xx would be "self voicing" with AT choosing a reading based on, but not necessarily equal to, the unicode name. Otherwise more or less every |
On the Edit: Specifically, the example appears to me as a good reason to avoid "unusual" uses of Unicode in places where one might expect letters in names. |
Yes, sorry: I meant the times symbol. I thought it was by did not see
that I copied the wrong character.
…On Fri, 17 Mar 2023, David Carlisle wrote:
Is intent="x" different than intent="times"?
Did you mean to use an x there, not × ?
as written, they are completely unrelated, the first would force a reading of "ex" the second is presumbly a core function so will
be read however the system reads times.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because youcommented.[AABTULACJ4RBR4RZNHJRJ2TW4RXOJA5CNFSM6AAAAAAV5ZBZN2WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTSX3I4RC.
gif] Message ID: ***@***.***>
|
It's one thing to say users should avoid exotic characters and not expect We can not simultaneously say that that default reading of Note this discussion relates mostly to the template variants. In the functional variant in the current spec we never so far allowed such characters in concept-or-literal. Although that does complicate slightly the "implied intent" for |
Let me try this more explicitly: MathML's token elements can take pretty much any Unicode as content; usually a letter or math operator, but possibly an emoji or anything. I assume that an AT will produce some speech for MathML with funny characters but without Wouldn't we expect the same speech for |
As long as we don't call these |
we don't have any currently open grammar proposal with a "concept" clause. The functional version from current spec has for 446 or the current spec I'd allow more or less anything in the grammar but restrict core and open concept lists to ncname so if you use symbols (just as if you use |
We probably will to only enter concepts that are restricted to (I see that, while I was typing, @davidcarlisle has edited his comment to say essentially the same thing, I think) |
Not sure why we have turned legalistic, but I assume the "current spec" being referred to is at:
I have mentioned this before, but - in my opinion - the Intent syntax does not need to cover the full realm of the allowed text content of MathML's token nodes, as it has the opportunity to be a partial annotation, rather than a complete replacement for the presentation tree. All of these are workable: <mo>⍅</mo>
<mo intent="leftwards-vane">⍅</mo>
<mo intent="_bar_arrow_pointing_left">⍅</mo> but using the Unicode character directly in the intent string has to be teased apart as a "character literal" / "character data" case that is voicing as per the Unicode character name (I think also referred to as "self-voicing"). If the group insists the full Unicode spec should be grammatical, that is fine by me - but let us try to separate it from |
not legalistic, but we need some name for the various grammars. After yesterday's call we reduced the choices a bit, and in #450 I suggest merging the open PR and getting down to 2. In #446 the names that could be looked up as concepts are in the current spec, literals and concepts are in the same gramatical construct and can't really be separated, currently NCNAME, but we could make it less restricted while keeping the restriction on dictionary entries to be NCNAME, just as in current examples |
Re:
My understanding is that a literal is in fact a literal -- this is what should be spoken/sent to the speech engine. Hence, I see a difference between the content and literals. Content such as "⍅" and "/" are converted to words. Literals are passed to the speech engine unchanged (numbers might be adjusted for locale or the speech engine is told the locale). If literals are to be interpreted besides the already documented conversion of "-" and "_" to spaces, then that needs to be documented. Note that whereas "/" or "|" might have context AT can use to determine how to speak them, short of AI/natural language processing that is math-aware, there's no hope of using context. |
I can see that from mathCAT's perspective, sitting between the markup and an existing speech engine, but I don't think the spec layers things in that way. If I enter As such I'd expect That said, for the functional version in the spec, it's not currently allowed. If we wanted to allow it without allowing arbitrary unicode literals we could (as @dginev indicated above) allow single symbols while keeping them as a separate grammatical construct. So
Then a |
The only way to know when something starts and end is spaces and a few delimiters, so it is hard to know how to break up arbitrary strings of characters without something like the |
A reality check... AT will send characters to speech engines. What happens with them varies on the speech engine, but they tend to only know a few. As a test, I tried
Try them yourself at NaturalReader and TTSReader. TTSReader does read ∞. There are other online TTS engines you can try. Those were the first two that showed up in a search I tried. AT may have some characters it knows about and converts them before sending them to the TTS engine. There are scripts/addons people develop to do the conversions for known problems with various chars/speech engines. Obviously, a math part of AT should be able to add words for the math parts of Unicode, but what's a math symbol? Certainly not 🐇 or an emoticon. MathCAT has about 4,000 math-related Unicode chars it knows about. SRE has about 2,700. I can't speak to JAWS or VoiceOver, but I suspect significantly less. In real life, not counting alphabets and digits, many fewer than 100 math characters are used in textbooks. For grade school algebra, it is around 10. So the lack of support of esoteric characters may be disappointing, but very few people will ever encounter the problem. For arXiv, I imagine there could be a problem. If someone (@dginev , hint, hint) does an arXiv search for all the Unicode chars used, I can extend MathCAT to cover them although I dubious the speech will be the desired speech since what I use will be based on the Unicode description. How does this relate to the intent character discussion? In my view |
Yes not surprising in a way, but still I'd rather specify you may use math symbols, then warn they don't work with current speech engines. When we started MathML, we agressively specified Unicode math element content, despite the fact Unicode math blocks had not been specifed, no fonts existed that matched the proposals and basically nothing other than But time passes and things improve.... I can now drop ⊞ in this web page with a reasonable expectation you have a font that supports U+229E. but I think you have persuaded me that allowing everything except |
It's good to get some of these hidden assumptions & expectations out in the open, independent of which way we decide the details; probably more of them need to find their way into the spec to make our choices understandable. I've understood the task to be more about supporting speech, and disambiguation, rather than guaranteeing speech. If anything the template-style approaches (like #446) emphasizes the guarantee. Conversely, the function-style approaches (like the current spec) emphasize disambiguation while leaving the AT lots of room to adapt its output. In fact, most of our discussions have left me feeling like guaranteeing speech was undesirable, even if possible. Moreover, the "Best" speech doesn't seem well enough defined for us to be specifying it; As good as MathCat is, guaranteeing speech at this point would forbid future improvement. I've been thinking of Finally, @davidcarlisle point about Unicode fonts is exactly relevant: whether or not we extend literal, weird, perhaps new, unicode will end up in either the MathML or intent, and you'll get different results under different circumstances. |
I assume you only mean math mode chars - that list I haven't compiled yet, but I have an old (03.2021) list that has a shortlist of 561 characters used in arXiv (after latexml processing, normalized to plain-text), with relative frequency counts. It's a very informal list (includes both text and math content, and is limited by latexml's implemented coverage), but should be more representative than random: If/when they are all supported in MathCAT, I can do a more formal study of the latest data to uncover more. |
@dginev: yes, I mean math mode, but probably not Asian chars that might identifiers in an Asian contribution in arXiv. They are probably only useful for someone using an Asian speech engine (for that language) and it would know what to do with them. MathCAT has definitions for a little over 5,000 Unicode chars (common chars, uncommon chars) I tried ~10 of the less common chars in your list and they were all in MathCAT. If you get a complete list, I'll write some code to make sure they are all in MathCAT. |
Regardless of whether we go with the template syntax or the function syntax, there is an ongoing discussion about what characters are allowed in a name. This issue is here to focus on that topic and pulls in a thread from #446 which starts at this comment.
Note: the MathML full meeting today agreed that we should pull out
number
as a specific terminal that uses.
as the decimal separator, so that is not in question . The discussion is focused onconcept-or-literal
,reference
,property
,literal
, andname
(depending on the version of the spec).To maybe summarize some comments:
xml:id
(without the restriction of them being unique)#
,@
, etc) are characters that can't be used for some future extension of the grammarThe text was updated successfully, but these errors were encountered: