Optimization of ATN serialization format #3494

KvanTTT · 2022-01-16T12:15:24Z

KvanTTT
Jan 16, 2022

Advanced encoding format

With the current ATN serialization format, any int value takes >= 2 bytes (16 bit). It's not optimal because most values not so big and they don't require all 16 bits. I suggest using varying bytes count per any int value: minimal is 1, maximum is 4. This allows much better compression. The encoding format may look the following (it looks like everything we need for ATN):

encoding	bytes count	type
0xxxxxxx	1	uint (7 bit)
100xxxxx xxxxxxxx	2	uint (13 bit)
101xxxxx xxxxxxxx xxxxxxxx	3	uint (21 bit)
11000000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx	5	uint (32 bit)
11111111	1	-1 (0xFFFF)

This scheme decreases the size of output data since most values are small. On the other hand, it supports big ATN without a 16-bit restriction on any int value. Also, it works fast.

This format is similar to MessagePack binary format but it's simpler since it requires only int type and one value for bad number (-1).

Array of long or base64 string instead of the current string encoding

With the old scheme, serializer and deserialized have to increment or decrement strange value (2). As I understand this is because of frequent value 0xFFFF becomes 0x1 and it takes only 1 byte for storage. But it impairs clarity and causes bugs like #1925. Binary data should not depend on string representation.

With the new encoding format, there is no need for such optimization because 0xFFFF is serialized to 1 byte as well (0b11111111).

Since optimization is not a problem, I suggest using more standard string encoding like base64 string because it looks more natural and doesn't require inc/dec hacks. Also, all popular languages support base64 encoding. Also, array of long can be used since it has the most compact string representation compared to smaller int types.

KvanTTT · 2022-01-17T18:05:47Z

KvanTTT
Jan 17, 2022
Author

There is a more compact scheme that operates not bytes but half-bytes. It's almost the same but most frequent values like -1, 0, 1 take 0.5 bytes instead of 1. Data size is especially important for targets where the code is being sent over the internet (JavaScript, WASM especially).

encoding	bytes count	type
0xxxxxxx	1	uint (7 bit, 2..129)
1000xxxx xxxxxxxx	2	uint (12 bit)
1001xxxx xxxxxxxx xxxxxxxx	3	uint (20 bit)
1010xxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxx	4.5	uint (32 bit)
1011	0.5	reserved
1100	0.5	0
1101	0.5	1
1110	0.5	reserved
1111	0.5	-1 (0xFFFF)

Reserved values can be also used for frequent data.

Also, Haffman encoding can be used to decrease ATN size more. But it requires a table and it has a more complicated implementation.

0 replies

parrt · 2022-02-24T01:09:12Z

parrt
Feb 24, 2022
Maintainer

Thanks for documenting this idea. It could prove useful in a future version.

0 replies

parrt · 2022-02-24T01:12:01Z

parrt
Feb 24, 2022
Maintainer

What if we did something crazy like make an array of ints, compress it, then save 2-byte pairs as chars for Java target? Expensive to decompress at runtime I guess.

2 replies

KvanTTT Feb 24, 2022
Author

Ints can be compressed/decompressed during encoding/decoding.

parrt Feb 26, 2022
Maintainer

well I'm talking about a global array compression by looking at the entire thing like 7z or zip.

ericvergnaud · 2022-02-24T08:14:16Z

ericvergnaud
Feb 24, 2022
Maintainer

Have we re-tried an array of ints ?
Many things have improved over the years (think of automatic usage of StringBuilder) so it could be that it's now an optimal solution ?

1 reply

KvanTTT Feb 24, 2022
Author

The main problem with int range is all ints should be within the range [0..65535]. It does not look optimal since it does not cover all int range and it does not support big numbers. Also, 0xFFFF is encoded incorrectly at all: #3555 StringBuilder is about other things.

parrt · 2022-02-26T19:04:12Z

parrt
Feb 26, 2022
Maintainer

@KvanTTT I think that @ericvergnaud was talking about 32-bit ints. I would dearly love to get rid of this whole encoding as strings crap. in fact no target should do this unless they have to. C++ and Go have these beautiful little integer arrays. let me go do an experiment but I'm almost certain that they have not changed the class file format such that they can do static arrays without creating massive constructor functions. These functions take like five byte codes per integer to initialize the array... kind of slow anyway.

2 replies

KvanTTT Feb 26, 2022
Author

Ok, I have no objection to using natural long/int/ushort arrays instead of strings. Arrays of longs have the smallest size in source code if it matters.

C++ and Go have these beautiful little integer arrays.

A small correction: they use ushort (16 bit) arrays, not int (32 bit) arrays.

parrt Feb 26, 2022
Maintainer

oh! well, presumably they could change that to whatever they wanted like int32.

parrt · 2022-02-26T19:12:45Z

parrt
Feb 26, 2022
Maintainer

Dang. Yep, same same.

critter:/tmp $ cat T.java
public class T {
	public static final int[] a = { 1, 2, 3, 4, 5 };
}
critter:/tmp $ javap -c T
Compiled from "T.java"
public class T {
  public static final int[] a;

  public T();
    Code:
       0: aload_0
       1: invokespecial #1                  // Method java/lang/Object."<init>":()V
       4: return

  static {};
    Code:
       0: iconst_5
       1: newarray       int
       3: dup
       4: iconst_0
       5: iconst_1
       6: iastore
       7: dup
       8: iconst_1
       9: iconst_2
      10: iastore
      11: dup
      12: iconst_2
      13: iconst_3
      14: iastore
      15: dup
      16: iconst_3
      17: iconst_4
      18: iastore
      19: dup
      20: iconst_4
      21: iconst_5
      22: iastore
      23: putstatic     #2                  // Field a:[I
      26: return
}

4 replies

parrt Feb 26, 2022
Maintainer

I also just revisited the base64 thing (available now in java 8) but that operates on bytes so we still have to do some kind of encoding to get bytes from ints.

KvanTTT Feb 26, 2022
Author

Info about base64 was written before our discussion in #3505 and we've come to the conclusion it's not a very good idea. Plain strings take fewer bytes and are encoded more easily.

KvanTTT Feb 26, 2022
Author

Interesting, it looks like .NET keeps static arrays somewhere else, not inside method bytecode: https://sharplab.io/#v2:EYLgdgpgLgZgHgGiiAlgGwD4AEAMACLARgBYBuAWACgqsAmAwgdjwG8q8OGA2PFMKANoBdPAGUoAQygoAxgEEATgokBPPAF48kAO69+w1nkII8tEwGYTxEwFY8AXwqVOedpyI8sxPAFkJfAAoASjcONmdOe1DXCI4PPUERAGEFCCkIAEl+RWUVYOjwlxcsZh0EgxY8LhNGEwAOEwBOE0J8R2ioyk6qIA And it's much more optimal.

class Program {
    static int[] StaticArray = new int[] { 1, 2, 3, 4, 5 };
    
    static void Main()
    {
    }
    
    static int[] CreateIntArray()
    {
        return new int[] { 6, 7, 8, 9, 10 };
    }
}

Bytecode:

.assembly _
{
    .custom instance void [mscorlib]System.Runtime.CompilerServices.CompilationRelaxationsAttribute::.ctor(int32) = (
        01 00 08 00 00 00 00 00
    )
    .custom instance void [mscorlib]System.Runtime.CompilerServices.RuntimeCompatibilityAttribute::.ctor() = (
        01 00 01 00 54 02 16 57 72 61 70 4e 6f 6e 45 78
        63 65 70 74 69 6f 6e 54 68 72 6f 77 73 01
    )
    .custom instance void [mscorlib]System.Diagnostics.DebuggableAttribute::.ctor(valuetype [mscorlib]System.Diagnostics.DebuggableAttribute/DebuggingModes) = (
        01 00 02 00 00 00 00 00
    )
    .permissionset reqmin = (
        2e 01 80 84 53 79 73 74 65 6d 2e 53 65 63 75 72
        69 74 79 2e 50 65 72 6d 69 73 73 69 6f 6e 73 2e
        53 65 63 75 72 69 74 79 50 65 72 6d 69 73 73 69
        6f 6e 41 74 74 72 69 62 75 74 65 2c 20 6d 73 63
        6f 72 6c 69 62 2c 20 56 65 72 73 69 6f 6e 3d 34
        2e 30 2e 30 2e 30 2c 20 43 75 6c 74 75 72 65 3d
        6e 65 75 74 72 61 6c 2c 20 50 75 62 6c 69 63 4b
        65 79 54 6f 6b 65 6e 3d 62 37 37 61 35 63 35 36
        31 39 33 34 65 30 38 39 15 01 54 02 10 53 6b 69
        70 56 65 72 69 66 69 63 61 74 69 6f 6e 01
    )
    .hash algorithm 0x00008004 // SHA1
    .ver 0:0:0:0
}

.class private auto ansi '<Module>'
{
} // end of class <Module>

.class private auto ansi beforefieldinit Program
    extends [mscorlib]System.Object
{
    // Fields
    .field private static int32[] StaticArray

    // Methods
    .method private hidebysig static 
        void Main () cil managed 
    {
        // Method begins at RVA 0x2050
        // Code size 1 (0x1)
        .maxstack 8

        IL_0000: ret
    } // end of method Program::Main

    .method private hidebysig static 
        int32[] CreateIntArray () cil managed 
    {
        // Method begins at RVA 0x2052
        // Code size 18 (0x12)
        .maxstack 8

        IL_0000: ldc.i4.5
        IL_0001: newarr [mscorlib]System.Int32
        IL_0006: dup
        IL_0007: ldtoken field valuetype '<PrivateImplementationDetails>'/'__StaticArrayInitTypeSize=20' '<PrivateImplementationDetails>'::'73E6E64B64CCADF091BE8790DD4A758DB959C64B0FB0C09024A160357331B89E'
        IL_000c: call void [mscorlib]System.Runtime.CompilerServices.RuntimeHelpers::InitializeArray(class [mscorlib]System.Array, valuetype [mscorlib]System.RuntimeFieldHandle)
        IL_0011: ret
    } // end of method Program::CreateIntArray

    .method public hidebysig specialname rtspecialname 
        instance void .ctor () cil managed 
    {
        // Method begins at RVA 0x2065
        // Code size 7 (0x7)
        .maxstack 8

        IL_0000: ldarg.0
        IL_0001: call instance void [mscorlib]System.Object::.ctor()
        IL_0006: ret
    } // end of method Program::.ctor

    .method private hidebysig specialname rtspecialname static 
        void .cctor () cil managed 
    {
        // Method begins at RVA 0x206d
        // Code size 23 (0x17)
        .maxstack 8

        IL_0000: ldc.i4.5
        IL_0001: newarr [mscorlib]System.Int32
        IL_0006: dup
        IL_0007: ldtoken field valuetype '<PrivateImplementationDetails>'/'__StaticArrayInitTypeSize=20' '<PrivateImplementationDetails>'::'4F6ADDC9659D6FB90FE94B6688A79F2A1FA8D36EC43F8F3E1D9B6528C448A384'
        IL_000c: call void [mscorlib]System.Runtime.CompilerServices.RuntimeHelpers::InitializeArray(class [mscorlib]System.Array, valuetype [mscorlib]System.RuntimeFieldHandle)
        IL_0011: stsfld int32[] Program::StaticArray
        IL_0016: ret
    } // end of method Program::.cctor

} // end of class Program

.class private auto ansi sealed '<PrivateImplementationDetails>'
    extends [mscorlib]System.Object
{
    .custom instance void [mscorlib]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = (
        01 00 00 00
    )
    // Nested Types
    .class nested private explicit ansi sealed '__StaticArrayInitTypeSize=20'
        extends [mscorlib]System.ValueType
    {
        .pack 1
        .size 20

    } // end of class __StaticArrayInitTypeSize=20


    // Fields
    .field assembly static initonly valuetype '<PrivateImplementationDetails>'/'__StaticArrayInitTypeSize=20' '4F6ADDC9659D6FB90FE94B6688A79F2A1FA8D36EC43F8F3E1D9B6528C448A384' at I_00002830
    .data cil I_00002830 = bytearray (
        01 00 00 00 02 00 00 00 03 00 00 00 04 00 00 00
        05 00 00 00
    )
    .field assembly static initonly valuetype '<PrivateImplementationDetails>'/'__StaticArrayInitTypeSize=20' '73E6E64B64CCADF091BE8790DD4A758DB959C64B0FB0C09024A160357331B89E' at I_00002848
    .data cil I_00002848 = bytearray (
        06 00 00 00 07 00 00 00 08 00 00 00 09 00 00 00
        0a 00 00 00
    )

} // end of class <PrivateImplementationDetails>

parrt Feb 26, 2022
Maintainer

They were able to avoid the problems with java's class file because they came later; lucky bastards! anyway it means that we should be using static short or integer arrays for C# as well rather than strings.

jcking · 2022-03-04T17:42:38Z

jcking
Mar 4, 2022
Collaborator

Looks like serialized ATN changes broke compilation for some grammars. This will need to be fixed before 4.10. @KvanTTT

CELLexer.java:136: error: unmappable character (0xB6) for encoding UTF-8
		"!�\n!\3!\6!�\n!\r!\16!�\3\"\3\"\3#\3#\3$\3$\3$\3$\5$�\n$\3%\3%\3%\3&\3"+

8 replies

parrt Mar 5, 2022
Maintainer

Line 377 of Target return String.valueOf(c); assumes UTF-8 target src encoding but many assume ascii unless they use unicode rule names. I'll have to back out this bit.

parrt Mar 5, 2022
Maintainer

Hmm... @KvanTTT this isPreviousOctal field I can't figure out. I'm gonna remove and then try to convert targets to use ints.

KvanTTT Mar 7, 2022
Author

To resolve ambiguities if chars are used in strings. For example, \111 can be treated as decimal 73 or as \11 (9) and char 1.

KvanTTT Mar 7, 2022
Author

@jcking could you please share a minimal grammar sample that fails? Or add PR with that.

parrt Mar 7, 2022
Maintainer

He and I worked it out and I altered code to go back a bit from your optimization to always use ascii to avoid encoding issues on disk. will not be problem soon since changing to int[] from string everywhere.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization of ATN serialization format #3494

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 17 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Optimization of ATN serialization format #3494

KvanTTT Jan 16, 2022

Advanced encoding format

Array of long or base64 string instead of the current string encoding

Replies: 7 comments · 17 replies

KvanTTT Jan 17, 2022 Author

parrt Feb 24, 2022 Maintainer

parrt Feb 24, 2022 Maintainer

KvanTTT Feb 24, 2022 Author

parrt Feb 26, 2022 Maintainer

ericvergnaud Feb 24, 2022 Maintainer

KvanTTT Feb 24, 2022 Author

parrt Feb 26, 2022 Maintainer

KvanTTT Feb 26, 2022 Author

parrt Feb 26, 2022 Maintainer

parrt Feb 26, 2022 Maintainer

parrt Feb 26, 2022 Maintainer

KvanTTT Feb 26, 2022 Author

KvanTTT Feb 26, 2022 Author

parrt Feb 26, 2022 Maintainer

jcking Mar 4, 2022 Collaborator

parrt Mar 5, 2022 Maintainer

parrt Mar 5, 2022 Maintainer

KvanTTT Mar 7, 2022 Author

KvanTTT Mar 7, 2022 Author

parrt Mar 7, 2022 Maintainer

KvanTTT
Jan 16, 2022

Replies: 7 comments 17 replies

KvanTTT
Jan 17, 2022
Author

parrt
Feb 24, 2022
Maintainer

parrt
Feb 24, 2022
Maintainer

KvanTTT Feb 24, 2022
Author

parrt Feb 26, 2022
Maintainer

ericvergnaud
Feb 24, 2022
Maintainer

KvanTTT Feb 24, 2022
Author

parrt
Feb 26, 2022
Maintainer

KvanTTT Feb 26, 2022
Author

parrt Feb 26, 2022
Maintainer

parrt
Feb 26, 2022
Maintainer

parrt Feb 26, 2022
Maintainer

KvanTTT Feb 26, 2022
Author

KvanTTT Feb 26, 2022
Author

parrt Feb 26, 2022
Maintainer

jcking
Mar 4, 2022
Collaborator

parrt Mar 5, 2022
Maintainer

parrt Mar 5, 2022
Maintainer

KvanTTT Mar 7, 2022
Author

KvanTTT Mar 7, 2022
Author

parrt Mar 7, 2022
Maintainer