-
Notifications
You must be signed in to change notification settings - Fork 38
Compression ContextualPredictionModel
This document describes the contextual prediction module used to produce probability distributions to feed to an entropy coder.
Contextual prediction determines the probability of a given value ocurring at a given location in the tree, by considering the ancestors at that location.
This is a static prediction model - no adaptive mechanisms are used. The format specification provides the complete static model derived from a corpus of source code.
The K-level context
for a value in the tree refers to
the length-K path suffix associated with that value.
The alphabet at a particular encoding step refers to the set of all allowed values at that location (e.g. node type, string-table reference, etc.)
For example:
Script {
statements: [
ExpressionStatement {
expression: BinaryExpression { // (A)
operator: BinaryOperator.Add,
left: IdentifierExpression {
name: ident("foo")
},
right: StaticMemberExpression {
object: IdentifierExpression {
name: ident("bar")
},
property: prop("zoom")
}
}
}
]
}
To encode the type-id of a node occurring under
ExpressionStatement.expression
(see line (A)
) -
the alphabet used would be derived from the type-bounds of
ExpressionStatement.expression
- which is the union
of BinaryExpression | UnaryExpression | ...
.
These type bounds, when normalized, yield a finite set of types, which is used directly as the alphabet.
String references, unsigned integers, signed integers, and double-precision number values are not encoded using this approach.
They derive from large or unconstrained alphabets, and the containing context predicts them poorly.
The prediction tables for this format are generated from analysis of a corpus of source-code, represented as a collection of Javascript files.
The analysis traverses all the files, for each candidate value computes the context for it, and records the count of the number of occurrences of that value within that context.
The computed probability for a given value V
within a
context C
would be count(V) / sum(count(w) for w in C)
-
which is the relative weight of V
across the sum of all
weights over all values that were observed under context C
.