Skip to content

Compression ContextualPredictionModel

Kannan Vijayan edited this page Sep 6, 2018 · 1 revision

This document describes the contextual prediction module used to produce probability distributions to feed to an entropy coder.

Contextual prediction determines the probability of a given value ocurring at a given location in the tree, by considering the ancestors at that location.

This is a static prediction model - no adaptive mechanisms are used. The format specification provides the complete static model derived from a corpus of source code.

Definition of Context

The K-level context for a value in the tree refers to the length-K path suffix associated with that value.

Definition of Alphabets

The alphabet at a particular encoding step refers to the set of all allowed values at that location (e.g. node type, string-table reference, etc.)

For example:

Script {
  statements: [
    ExpressionStatement {
      expression: BinaryExpression { // (A)
        operator: BinaryOperator.Add,
        left: IdentifierExpression {
          name: ident("foo")
        },
        right: StaticMemberExpression {
          object: IdentifierExpression {
            name: ident("bar")
          },
          property: prop("zoom")
        }
      }
    }
  ]
}

To encode the type-id of a node occurring under ExpressionStatement.expression (see line (A)) - the alphabet used would be derived from the type-bounds of ExpressionStatement.expression - which is the union of BinaryExpression | UnaryExpression | ....

These type bounds, when normalized, yield a finite set of types, which is used directly as the alphabet.

String References and Numbers

String references, unsigned integers, signed integers, and double-precision number values are not encoded using this approach.

They derive from large or unconstrained alphabets, and the containing context predicts them poorly.

Generation of Probabilities

The prediction tables for this format are generated from analysis of a corpus of source-code, represented as a collection of Javascript files.

The analysis traverses all the files, for each candidate value computes the context for it, and records the count of the number of occurrences of that value within that context.

The computed probability for a given value V within a context C would be count(V) / sum(count(w) for w in C) - which is the relative weight of V across the sum of all weights over all values that were observed under context C.