Skip to content

Private string length

Adam Newbold edited this page Apr 26, 2021 · 1 revision

Q: Can you say what the expected string length of the encoded private string would be, given a private string of N char? Is there a formula?

A: Wow, great question. I can come up with a general formula for a maximum potential length based on bytes of input:

p = (b * 9) - 1

Where p is the length of the encoded private string and b is the number of bytes of the unencoded private string.

To make sense of this, it helps to walk through the conversion process. Each byte of raw private input is converted to a binary value, which can take up to 8 bytes of data. But the reason I say "up to" is that not all ASCII characters wind up taking all 8 bytes (a full octet); many take 7 or even 6 bytes, depending on the character. For instance:

  1. The number 7 converts to 110111, which takes 6 bytes.
  2. The letter G converts to 1000111, which takes 7 bytes.

Because of the variable length of binary data, Steganographr needs to add a special character between each byte of encoded data so it knows where one binary representation ends and the next begins. So, that adds one extra byte of encoded data that goes "between" the existing encoded octets.

An example of multibyte/Unicode input is the character は, which converts to 111000111000000110101111, which is actually three separate bytes (each represented by a full 8 byte octet). Separated with a space, this would look like 11100011 10000001 10101111.

If every converted byte took a full octet, I wouldn't need to worry about a spacer byte—I could just smash all of the binary encoded data together and then when converting back I’d break it apart by splitting it at every 8th character. But, combining the inputs in the examples above ("7Gは") you can see how you'd get converted binary data of different lengths, hence the need for the spacer.

So, back to the formula: multiplying the number of bytes by 9 accounts for a maximum of 8 binary bytes plus 1 spacer byte, but there is no spacer byte after the final original byte (which is why the - 1 part is there at the end).

Everything above discussed bytes... but what about characters—which is what you actually asked about? Well, for all of the reasons mentioned above, the answer is still "it depends". Though I think in the case of Japanese, things are pretty predictable: everything in the Unicode CJK Unified Ideographs block takes 3 bytes (so, that's all kanji), and I believe that all hiragana and katakana also take 3 bytes. Given that, you can actually make a more predictable formula that provides the length of the encoded message based on the number of Japanese characters:

p = (c * 25) - 1 (where c is the number of Japanese characters)

Putting this to the test:

  • Private input of 1 character, 水: 1 * 27 - 1 = 26 and Steganographr encodes this to 26 hidden bytes
  • Private input of 2 characters, です: 2 * 27 - 1 = 53 and Steganographr encodes this to 53 hidden bytes
  • Private input of 9 characters, これはテストです。: 9 * 27 - 1 = 242 and Steganographr encodes this to 242 hidden bytes

The final takeaway here is that longer private messages can generate really large strings. 😄

Clone this wiki locally