Replies: 7 comments 17 replies
-
There is a more compact scheme that operates not bytes but half-bytes. It's almost the same but most frequent values like -1, 0, 1 take 0.5 bytes instead of 1. Data size is especially important for targets where the code is being sent over the internet (JavaScript, WASM especially).
Reserved values can be also used for frequent data. Also, Haffman encoding can be used to decrease ATN size more. But it requires a table and it has a more complicated implementation. |
Beta Was this translation helpful? Give feedback.
-
Thanks for documenting this idea. It could prove useful in a future version. |
Beta Was this translation helpful? Give feedback.
-
What if we did something crazy like make an array of ints, compress it, then save 2-byte pairs as chars for Java target? Expensive to decompress at runtime I guess. |
Beta Was this translation helpful? Give feedback.
-
Have we re-tried an array of ints ? |
Beta Was this translation helpful? Give feedback.
-
@KvanTTT I think that @ericvergnaud was talking about 32-bit ints. I would dearly love to get rid of this whole encoding as strings crap. in fact no target should do this unless they have to. C++ and Go have these beautiful little integer arrays. let me go do an experiment but I'm almost certain that they have not changed the class file format such that they can do static arrays without creating massive constructor functions. These functions take like five byte codes per integer to initialize the array... kind of slow anyway. |
Beta Was this translation helpful? Give feedback.
-
Dang. Yep, same same.
|
Beta Was this translation helpful? Give feedback.
-
Looks like serialized ATN changes broke compilation for some grammars. This will need to be fixed before 4.10. @KvanTTT
|
Beta Was this translation helpful? Give feedback.
-
Advanced encoding format
With the current ATN serialization format, any int value takes >= 2 bytes (16 bit). It's not optimal because most values not so big and they don't require all 16 bits. I suggest using varying bytes count per any int value: minimal is 1, maximum is 4. This allows much better compression. The encoding format may look the following (it looks like everything we need for ATN):
This scheme decreases the size of output data since most values are small. On the other hand, it supports big ATN without a 16-bit restriction on any int value. Also, it works fast.
This format is similar to MessagePack binary format but it's simpler since it requires only int type and one value for bad number (-1).
Array of long or base64 string instead of the current string encoding
With the old scheme, serializer and deserialized have to increment or decrement strange value (2). As I understand this is because of frequent value
0xFFFF
becomes0x1
and it takes only 1 byte for storage. But it impairs clarity and causes bugs like #1925. Binary data should not depend on string representation.With the new encoding format, there is no need for such optimization because
0xFFFF
is serialized to 1 byte as well (0b11111111).Since optimization is not a problem, I suggest using more standard string encoding like base64 string because it looks more natural and doesn't require inc/dec hacks. Also, all popular languages support base64 encoding. Also, array of long can be used since it has the most compact string representation compared to smaller int types.
Beta Was this translation helpful? Give feedback.
All reactions