Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's in a name (also "reference", etc)? #448

Open
NSoiffer opened this issue Mar 16, 2023 · 22 comments
Open

What's in a name (also "reference", etc)? #448

NSoiffer opened this issue Mar 16, 2023 · 22 comments
Labels
intent Issues involving the proposed "intent" attr

Comments

@NSoiffer
Copy link
Contributor

Regardless of whether we go with the template syntax or the function syntax, there is an ongoing discussion about what characters are allowed in a name. This issue is here to focus on that topic and pulls in a thread from #446 which starts at this comment.

Note: the MathML full meeting today agreed that we should pull out number as a specific terminal that uses . as the decimal separator, so that is not in question . The discussion is focused on concept-or-literal, reference, property, literal, and name (depending on the version of the spec).

To maybe summarize some comments:

  • being freer with what is allowed means less errors that can be made
  • for at least literals, speech engines won't know what to do with non letters/digits (e.g., U+2345 "⍅") so results could be unexpected
  • reference names should be like an xml:id (without the restriction of them being unique)
  • Unicode defines a regexp for "identifier names"
  • any of the characters allowed (#, @, etc) are characters that can't be used for some future extension of the grammar
@NSoiffer NSoiffer added the intent Issues involving the proposed "intent" attr label Mar 16, 2023
@brucemiller
Copy link
Contributor

With regard to your second point, are

<mo>⍅</mo>
<mo intent="⍅">x</mo>

expected to be treated differently by AT? (Offhand, I'd expect the same treatment)

@brucemiller
Copy link
Contributor

While I prefer permissive to restrictive, I agree that it reserving some characters for future use is a good idea. Candidates might include the categories: keyboard-top-row, punctuation, fences, quotes. To the extent those categories make sense and to the extent that they wouldn't reasonably be needed in normal mathematical phrases.

@dginev
Copy link
Contributor

dginev commented Mar 17, 2023

I'd split the issue into two sub-components:

  1. property and concept are the two caregories that need name-specific considerations (:system-of-equations, binomial-coefficient). I have always been fond of dash-separated names for that as in

    IntentName  := Letter (Letter | '-' Letter)*
  2. The other categories Neil enumerated may be closer to system data than natural names (such as xml:id or number formats). Bruce pointed out that reference/arg values fit closer to ids than names.

@davidfarmer
Copy link
Contributor

davidfarmer commented Mar 17, 2023 via email

@davidcarlisle
Copy link
Collaborator

for at least literals, speech engines won't know what to do with non letters/digits (e.g., U+2345 "⍅") so results could be unexpected

I thought we could assume (even if it is not currently true) that core math blocks such as Mathematical Operators 22xx and perhaps Miscellaneous Technical 23xx would be "self voicing" with AT choosing a reading based on, but not necessarily equal to, the unicode name. Otherwise more or less every <mo> is going to need an intent to be read at all.

@dginev
Copy link
Contributor

dginev commented Mar 17, 2023

On the intent="⍅", do we know if the Unicode situation is expected/desired to be dramatically improved in the broader AT engines @NSoiffer ? I remembered a 2019 example that illustrated rather nicely how rocky things can get text+audio here.

Edit: Specifically, the example appears to me as a good reason to avoid "unusual" uses of Unicode in places where one might expect letters in names.

@davidfarmer
Copy link
Contributor

davidfarmer commented Mar 17, 2023 via email

@davidcarlisle
Copy link
Collaborator

davidcarlisle commented Mar 17, 2023

Edit: Specifically, the example appears to me as a good reason to avoid "unusual" uses of Unicode in places where one might expect letters in names.

It's one thing to say users should avoid exotic characters and not expect <mo>&#x1FACF;</mo> (added Unicode 15, last year) gets read as "donkey", but another for the grammar to ban them.

We can not simultaneously say that that default reading of <mo> is its content, but that we can not use most characters that are used in mo as string templates because they can not be read.

Note this discussion relates mostly to the template variants. In the functional variant in the current spec we never so far allowed such characters in concept-or-literal. Although that does complicate slightly the "implied intent" for <mo>⍅</mo> as it can't simply be the content and has to be something like intent="leftwards-vane" or intent="are-you-really-using-APL"

@brucemiller
Copy link
Contributor

Let me try this more explicitly: MathML's token elements can take pretty much any Unicode as content; usually a letter or math operator, but possibly an emoji or anything. I assume that an AT will produce some speech for MathML with funny characters but without intent, eg. <mo>⍅</mo>. I don't particularly care what speech.

Wouldn't we expect the same speech for <mo intent="⍅">x</mo>, whatever it is? If so, I don't see why we need to restrict it.

@dginev
Copy link
Contributor

dginev commented Mar 17, 2023

Wouldn't we expect the same speech for x, whatever it is? If so, I don't see why we need to restrict it.

As long as we don't call these concepts, yes. They appear to fall closer to the "data" realm, i.e. "character literals".

@davidcarlisle
Copy link
Collaborator

davidcarlisle commented Mar 17, 2023

As long as we don't call these concepts, yes. They appear to fall closer to the "data" realm, i.e. "character literals".

we don't have any currently open grammar proposal with a "concept" clause.

The functional version from current spec has concept-or-literal which is NCNAME so does not allow math symbols (which makes it work better as a concept than a literal) or 446 has various literal in various versions allowing or no allowing symbols. Neither of the mathcat current implementations allow symbols in intent

for 446 or the current spec I'd allow more or less anything in the grammar but restrict core and open concept lists to ncname so if you use symbols (just as if you use _) you know it is a literal.

@brucemiller
Copy link
Contributor

We probably will to only enter concepts that are restricted to NCNAME into a Core dictionary, or even the Open dictionary. But (in the current spec) anything not recognized as a concept is effectively a literal. And since the AT is going to have to deal with funny characters anyway (those in token content that aren't overridden by intent), I don't see much benefit in forbidding them from concept-or-literal (in its various forms).

(I see that, while I was typing, @davidcarlisle has edited his comment to say essentially the same thing, I think)

@dginev
Copy link
Contributor

dginev commented Mar 17, 2023

Not sure why we have turned legalistic, but I assume the "current spec" being referred to is at:
https://w3c.github.io/mathml/#mixing_intent_grammar

and other characters that don't conform to NCName are not grammatical in that proposal. In which case they would be delegated to some (I think currently unspecified?) "error recovery" procedure.

I have mentioned this before, but - in my opinion - the Intent syntax does not need to cover the full realm of the allowed text content of MathML's token nodes, as it has the opportunity to be a partial annotation, rather than a complete replacement for the presentation tree.

All of these are workable:

<mo>⍅</mo>
<mo intent="leftwards-vane">⍅</mo>
<mo intent="_bar_arrow_pointing_left">⍅</mo>

but using the Unicode character directly in the intent string has to be teased apart as a "character literal" / "character data" case that is voicing as per the Unicode character name (I think also referred to as "self-voicing").

If the group insists the full Unicode spec should be grammatical, that is fine by me - but let us try to separate it from concept, property and the underscore speech literals. Maybe character literals or character data fit better. I am implying adding one more case with the same kind of treatment that the current grammar gives to number.

@davidcarlisle
Copy link
Collaborator

davidcarlisle commented Mar 17, 2023

Not sure why we have turned legalistic,

not legalistic, but we need some name for the various grammars. After yesterday's call we reduced the choices a bit, and in #450 I suggest merging the open PR and getting down to 2.

In #446 the names that could be looked up as concepts are properties which are a separate gramatical class (and in the original proposal there restricted to NCNAME). literal, which gets spoken, was unrestricted in @brucemiller's proposal but just letters and - in mathcat currently

in the current spec, literals and concepts are in the same gramatical construct and can't really be separated, currently NCNAME, but we could make it less restricted while keeping the restriction on dictionary entries to be NCNAME, just as in current examples _foo is a valid concept-or-literal but guaranteed not to be a known concept

@NSoiffer
Copy link
Contributor Author

Re:

Are

<mo>⍅</mo>
<mo intent="⍅">x</mo>

expected to be treated differently by AT? (Offhand, I'd expect the same treatment)

My understanding is that a literal is in fact a literal -- this is what should be spoken/sent to the speech engine. Hence, I see a difference between the content and literals. Content such as "⍅" and "/" are converted to words. Literals are passed to the speech engine unchanged (numbers might be adjusted for locale or the speech engine is told the locale).

If literals are to be interpreted besides the already documented conversion of "-" and "_" to spaces, then that needs to be documented. Note that whereas "/" or "|" might have context AT can use to determine how to speak them, short of AI/natural language processing that is math-aware, there's no hope of using context.

@davidcarlisle
Copy link
Collaborator

@NSoiffer

Content such as "⍅" and "/" are converted to words. Literals are passed to the speech engine unchanged

I can see that from mathCAT's perspective, sitting between the markup and an existing speech engine, but I don't think the spec layers things in that way.

If I enter <mi>a</mi><mo>→</mo><mi>b</mi> and it gets read as a right arrow b I'm happy and don't really expect to know if mathcat picked up the and sent right arrow to the speech engine, or if got sent to the speech engine.

As such I'd expect <mo>→</mo>, <mo intent="→">→</mo> and <mo intent="right arrow">→</mo> all to work.
If in practice <mo intent="→">→</mo> doesn't work, we should warn users to avoid that, but I don't think we should prevent it working.

That said, for the functional version in the spec, it's not currently allowed. If we wanted to allow it without allowing arbitrary unicode literals we could (as @dginev indicated above) allow single symbols while keeping them as a separate grammatical construct.

So

term               := concept-or-literal | number | symbol | reference 
concept-or-literal := NCName
number             := '-'? \d+ ( '.' \d+ )?
symbol      := [\pSm]

Then a symbol is like a literal except it's a single Unicode Math character (exact characters classes to be determined) and you generate speech from a symbol as you would from element content.

@NSoiffer
Copy link
Contributor Author

The only way to know when something starts and end is spaces and a few delimiters, so it is hard to know how to break up arbitrary strings of characters without something like the symbol non-terminal you introduce above. So if we want to allow them in intent, then I'm in favor of making it clear via the grammar that a@>=b etc., is not legal, or is at least undefined as to what should happen.

@NSoiffer
Copy link
Contributor Author

A reality check...

AT will send characters to speech engines. What happens with them varies on the speech engine, but they tend to only know a few. As a test, I tried

This apl symbol ⍅ doesn't read.
This greek one does:  ω.
First inequality ≤ reads but the second doesn't:  ≦.
Infinity doesn't work: ∞ Neither does angle  ∠.

Try them yourself at NaturalReader and TTSReader. TTSReader does read ∞. There are other online TTS engines you can try. Those were the first two that showed up in a search I tried.

AT may have some characters it knows about and converts them before sending them to the TTS engine. There are scripts/addons people develop to do the conversions for known problems with various chars/speech engines.

Obviously, a math part of AT should be able to add words for the math parts of Unicode, but what's a math symbol? Certainly not 🐇 or an emoticon. MathCAT has about 4,000 math-related Unicode chars it knows about. SRE has about 2,700. I can't speak to JAWS or VoiceOver, but I suspect significantly less.

In real life, not counting alphabets and digits, many fewer than 100 math characters are used in textbooks. For grade school algebra, it is around 10. So the lack of support of esoteric characters may be disappointing, but very few people will ever encounter the problem. For arXiv, I imagine there could be a problem. If someone (@dginev , hint, hint) does an arXiv search for all the Unicode chars used, I can extend MathCAT to cover them although I dubious the speech will be the desired speech since what I use will be based on the Unicode description.

How does this relate to the intent character discussion? In my view intent serves as a way to guarantee speech. Allowing symbols in the intent rather than requiring words removes that guarantee. To be fair though, there is no guarantee the words are pronounced well -- I get a good laugh many times a day at the mispronunciations such as "spo tee fee" for "spotify".

@davidcarlisle
Copy link
Collaborator

@NSoiffer

A reality check...

Yes not surprising in a way, but still I'd rather specify you may use math symbols, then warn they don't work with current speech engines.

When we started MathML, we agressively specified Unicode math element content, despite the fact Unicode math blocks had not been specifed, no fonts existed that matched the proposals and basically nothing other than + and - worked unless you used custom mapping to legacy 8 bit encodings from tex or mathematica or adobe symbol or whatever.

But time passes and things improve.... I can now drop ⊞ in this web page with a reasonable expectation you have a font that supports U+229E.

but I think you have persuaded me that allowing everything except (),: is probably reckless, but I think having a separate symbol clause that just allows math symbols (class Sm for example) and can have a suitable warning could work. If we merge the PR to add properties I'll make a new PR to review adding symbol.

@brucemiller
Copy link
Contributor

It's good to get some of these hidden assumptions & expectations out in the open, independent of which way we decide the details; probably more of them need to find their way into the spec to make our choices understandable.

I've understood the task to be more about supporting speech, and disambiguation, rather than guaranteeing speech. If anything the template-style approaches (like #446) emphasizes the guarantee. Conversely, the function-style approaches (like the current spec) emphasize disambiguation while leaving the AT lots of room to adapt its output. In fact, most of our discussions have left me feeling like guaranteeing speech was undesirable, even if possible. Moreover, the "Best" speech doesn't seem well enough defined for us to be specifying it; As good as MathCat is, guaranteeing speech at this point would forbid future improvement.

I've been thinking of literal from the point of view of the intent grammar, rather than necessarily exactly what is passed to a "Speech Engine". Is the purpose of intent to support (guarantee, or ...) speech, or is it an interface to a speech engine? If the latter, do we need to specify what that is? I have been thinking of "speech engine" as an internal concern of the AT, important though it may be.

Finally, @davidcarlisle point about Unicode fonts is exactly relevant: whether or not we extend literal, weird, perhaps new, unicode will end up in either the MathML or intent, and you'll get different results under different circumstances.

@dginev
Copy link
Contributor

dginev commented Jul 26, 2023

If someone (@dginev , hint, hint) does an arXiv search for all the Unicode chars used, I can extend MathCAT to cover them although I dubious the speech will be the desired speech since what I use will be based on the Unicode description.

I assume you only mean math mode chars - that list I haven't compiled yet, but I have an old (03.2021) list that has a shortlist of 561 characters used in arXiv (after latexml processing, normalized to plain-text), with relative frequency counts. It's a very informal list (includes both text and math content, and is limited by latexml's implemented coverage), but should be more representative than random:

arxmliv_chars.txt

If/when they are all supported in MathCAT, I can do a more formal study of the latest data to uncover more.

@NSoiffer
Copy link
Contributor Author

@dginev: yes, I mean math mode, but probably not Asian chars that might identifiers in an Asian contribution in arXiv. They are probably only useful for someone using an Asian speech engine (for that language) and it would know what to do with them.

MathCAT has definitions for a little over 5,000 Unicode chars (common chars, uncommon chars)

I tried ~10 of the less common chars in your list and they were all in MathCAT. If you get a complete list, I'll write some code to make sure they are all in MathCAT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
intent Issues involving the proposed "intent" attr
Projects
None yet
Development

No branches or pull requests

5 participants