-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spec] Expression grammar formalization #73
Comments
Regarding array expressions mentioned in the ticket: The only valid logical operation I could think of for arrays is the ternary operator. There are limited instances where rows are doubled so that behavior for the array length can change, usually based on version. This isn't actually getting addressed directly in this ticket, but it illustrates my point: <add name="Triangles" ... arr1="Num Triangles" ver2="10.0.1.2" />
<add name="Triangles" ... arr1="Num Triangles" ver1="10.0.1.3" cond="Has Triangles" /> Without using equality or comparison operators,
It could also be |
I missed that |
I have added example WIP grammars to the main ticket. They were collectively too long so I put them on Gist. |
I have added information to the main ticket regarding floating point literals and their use in expressions. Mostly, this floating point comparison required for <add name="Backlight Power" type="float" cond="Rimlight Power == 0x7F7FFFFF" ... /> I believe floating point literals should possibly be removed or explicitly disallowed in the grammar. As for the above condition, it makes for a very problematic edge case. The only thing I can possibly think of is this: <add name="Backlight Power" type="float" cond="(Rimlight Power ^ 0x7F7FFFFF) == 0" ... /> Except in C++, etc. bitwise operations are simply not defined on floating point values. This does at least carry more connotation that you are looking at the bits and not the actual floating point value. We could also special-case the value
And allow that as the only floating point comparison. As far as precision goes, numerically the string seems to be able to vary from |
Found a slight issue with my identifier regex involving hex literals. It should only be matching the text, not the integers. The solution was to make sure that hex literal parsing comes before identifier parsing, but I am also using the regex directly to extract identifiers from the expression string for some purposes. I will explore whether I should explicitly exclude matching of hex integers from the identifier regex. |
I just went ahead and put a Note: |
For niftools#76. `calc` does not need to be supported and is only used for pre-serialization preparation of the data. This can also be done by hand. Some revisions also need to be made for niftools#70 and niftools#73 to account for another attribute with its own expression grammar and tokens. However, the tokens introduced for `calc` will not be added to the tokens in the XML and will need to be explicitly supported if supporting `calc`. They are: 1D array size (uses regex): `#LEN[(.*?)]#` 2D array size (uses regex): `#LEN2[(.*?)]#` Ternary `?`: `#THEN#` Ternary `:`: `#ELSE#` Also removes any unnecessary `calculated` without replacement. Additionally, `calc` is used to limit array size for the time being, though it's possible this should be done with its own attribute, such as `maxlen`.
For the exact comparison with floats, I don't think it makes sense to write the hex bytes, since there is no indication that it is a float (aside from the data type of the field, but I don't think there should be special names for any types). The most logical choice to me to exactly represent floats would be the hexadecimal literals as explained here: https://en.wikipedia.org/wiki/IEEE_754#Character_representation In order to make a distinction between the various float precisions, that would have to be implied by the precision of the hexadecimal number. |
The repo needs files which include grammars for the attributes that require evaluation of their strings.
These attributes are:
<add>
:arr1
arr2
cond
vercond
arg
(?)For each of these, the grammar would answer:
There are additional things to formalize outside of the grammar itself, such as having multiple grammars specialized for the various attributes. For example, logical operations are not valid in
arr1
andarr2
, or are they? Since we haven't formalized this, there is no actual answer. Right now the only operations that makesarr1
andarr2
expressions are arithmetic though extremely limited (only 1-2 uses iirc).I will elaborate on this in the comments, since discussing it at the top of the ticket will be likely overwhelming. :)
Identifier
The rules for what a valid identifier is are fairly simple. It's whatever makes a valid
name
for any of the XML tags, plus a few reserved keywords like the ARG and TEMPLATE tokens. However the latter does not apply to all of the attributes equally.Identifier Regex
The current regex matches any sequence of alphanumeric (plus
:
) words separated by a single space.BSConnectPoint::Parents
. Colon is also used in some name attributes already.-
is an arithmetic operator.Rigorous
Edit: I have added
\b
to the start as it is more explicit about requiring alpha at the start. Without it it would incorrectly match0x10000
.Disallows:
A-Z
,a-z
,0-9
,_
and:
.Simplified
Almost exactly half the steps in this case, but the output would have to be trimmed.
Allows:
Number
Version Regex
Gamebryo versions are considered a number because they are merely a way of representing each component bit shifted into an analogous number, best seen in hex, e.g.
20.2.0.7
=>0x14020007
.It gets complicated because to support versions 2.3, 3.0, 3.03, and 3.1 would require a regex that can also match floats. However, none of these versions are currently used in
vercond
. They are only referred to inver1
andver2
which are not expressions.Simplest 4-Component or 2+ Component
Allows:
3.0.1
(BAD)99.99.99.999
99.99.99.999.99.99.99.999
Somewhat Rigorous
Disallows:
3.0.1
Allows:
99.99.99.999
99.99.99.999.99.99.99.999
Rigorous
The number of steps (in Python at least) blows up with both a negative lookbehind and lookahead. In PCRE (php) it is only ~500 steps. Uncached, both take about 300ms, but even cached, Python takes ~5ms which is quite slow.
Disallows:
Pedantic
I'm not patient enough to write this one, but it would include minimizing the value ranges of each component to what exists in reality, kind of like limiting IPs to 255.255.255.255 among other limitations.
Numeric Regexes
Decimal
Any series of 1 or more digits that are not directly succeeded by a dot (in order to not match floats).
Hexadecimal
Binary
Float
Simple
This version of the regex could erroneously match 4-component Version identifiers.
Rigorous
This still doesn't apply to the rarer formats
0.-0. 0.f -0.f 1e10 1e-5
, but no one really writes them that way anyway.Other Identifiers
TEMPLATE and ARG
These are currently special key words and should formally be considered a token. Their use outside of the intended areas (which also aren't currently defined) should error. These two tokens are also a good example of why
vercond
at least needs its own separate grammar (aside from arithmetic making no sense there).TEMPLATE is not actually used in any attributes with expressions, only
type
andtemplate
, but it technically could be. Should it?cond="TEMPLATE == NiProperty"
i.e. template specialization.TEMPLATE and ARG should probably take on the token delimiter syntax proposed in [Spec] Token definitions #70 . It would make things consistent and all strings could be split on tokens via one single regex
Operators
For ease of reading I will make tables of what is in use and what operators do and do not need to be implemented that aren't already being used.
Arithmetic
Addition, Multiplication, and Division are all used. Subtraction used to be but was actually unnecessary for that field.
+
*
/
-
%
++
--
=
Logical and Bitwise
&&
||
!
&
|
<<
>>
^
~
Member Access
With the introduction of
BSVertexDesc
, a compound that is attempting to mimic a bitfield that spans a 64-bit integer, a child ofBSVertexDesc
needed to be passed into another compound via the shared parent.Most programming languages use
.
for this, but with our identifiers having spaces, it just looks awkward. (Also, in this particular case, the real answer is to actually introduce a real bitfield type, which is relevant to #3 and the Flags type.)However, a member access operator is still potentially useful and should be formalized. It could potentially be used nearly everywhere because
User Version 2
is actually incorrect. The header should look something like this:As this is exactly what happens in their engine (minus the conditions on the header version of course, since it's assumed).
Then anywhere there is
User Version 2
it would be replaced withBSStream\Version
.Global Variables
This brings up the next issue, which is globally available variables. Right now, all
vercond
assume access to the following identifiers in some way or another:Version
User Version
User Version 2
(i.e.BSStream\Version
)Since they have special meaning, they should probably be tokenized (see #70), and any parser should see the token and replace it with the correct value. This is of course only for
vercond
--furthering the need for two grammars.These variables are referenced in
cond
but only in the Header compound since they are actually local variables in that case. Which brings up the next thing in need of specification: Do we make the version values explicitly unavailable tocond
? I could contrive uses for having them incond
(as I mentioned I will illustrate in a comment below), but these mostly involve using ternary expressions, something that we also haven't specified. The important thing is thatcond
is meant for dynamically changing its size and read/write status based on the values of its antecedent siblings.Therefore the assumption will be that these global values are only available in the grammar specific to
vercond
, as special keywords.Grammar Format/Notation
I have already nearly finished writing the grammars for both
cond
andvercond
using a variant of PEG (Packrat) notation. Specifically, this variant can be parsed by Arpeggio for use by nifxml.py.We do not have to decide on only one format however. A grammars folder has been added to the repo root and any formats we would like to support can be added to it for each grammar. If the files are not actually parsed directly by a library, etc. then at least the grammar is there for reference.
Grammar Requirements for Each Attribute
Operators
cond
vercond
arr*
arg
!
)-
)(
)
꙳: In the
BSVertexDesc
case, some masking and bit shifting on a plain int64 inside ofarg
would have been enough to pass the required information to the compound. So, bitwise could be nice to introduce forarg
.Operands
cond
vercond
arr*
arg
#ARG#
꙳#TEMPLATE#
꙳꙳: Any columns with an ❌ for this identifier/token means that the grammar would implement these as reserved keywords, but do nothing with them (or error, etc.).
As can be seen by comparing the columns, each attribute needs its own specific grammar
even excluding the possible support of bitwise ops inarray and args still differ in allowed operands.arg
,Edit: It was overlooked that
arr1
already uses bitwise operators for UV Sets.Example Grammars (WIP)
Still largely WIP, and I haven't implemented some operators yet, like modulus.
Cond (Gist)
Currently has the entire superset of all the grammars, i.e. nothing has been implemented that is not allowed in it that is allowed in other grammars.
Version
andUser Version
appear incond
only in theHeader
compound. These keywords should be reserved and unavailable for use incond
, but that requires morphing them into tokens such as#VER#
and#USER#
for use invercond
.There's a very odd case of floating point comparison required for
BSLightingShaderProperty
for FO4:To assure an exact comparison, the RHS is written as essentially hex bytes, and it is assumed that the expression parser convert the float to hex bytes as well for the LHS.
Vercond (Gist)
As discussed, only has relational and logical operators. No member access operator, bitwise, or arithmetic operators.
Current Issues:
Version
,User Version
, etc. by reserving as keywords.#VER#
and#USER#
and that way they are unambiguous.VXX_X_X_X
. This allows any language, interpreted or generated, to skip converting theXX.X.X.X
format to an integer and use the enumeration value directly.Arr (Gist)
No relational or logical operators. No floating point numbers. Has member access operators, bitwise, and arithmetic.
Arg
Haven't approached this with any grammar yet. Still not totally sure if it will be necessary.
Note
This ticket is WIP and I will be updating it more this evening, and adding the examples of the PEG grammar once I finalize some things. I do have Arpeggio working fully with the grammars I've defined, and it will be replacing the expression parsing in nifxml.py.
I have yet to elaborate on:
The newSee checklists above.<version>
ID attribute (Issue [Spec] Version tag changes #69), and it becoming its own identifier format in addition to--or maybe replacing--theXX.X.X.X
format used now. Essentially, the version has a unique identifier and can be referenced directly now, even for version ranges, which is simply all the<version>
between two specified IDs.See checklists above.Version
vs#VERSION#
(or#VER#
, etc.) in expressions. That is, we need to make Version, User Version, and BSStream Version reserved keywords or tokens, and also disallow them incond
because it can't have globals. But in theHeader
compound, they are local and need to be used as locals, and so can't be reserved. Thus all theVersion
etc. invercond
probably need to be tokenized and I will elaborate on that later.The text was updated successfully, but these errors were encountered: