Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible sort issue for times approaching 1970 #10

Open
bplatz opened this issue Jun 22, 2016 · 15 comments
Open

Possible sort issue for times approaching 1970 #10

bplatz opened this issue Jun 22, 2016 · 15 comments

Comments

@bplatz
Copy link

bplatz commented Jun 22, 2016

When base-encoding the bigint:

Since the "left" 64 bits are using milliseconds since epoch, and are technically allowed to go up to a value of (Long/MAX_VALUE), and also go down to zero, the character count of the final encoded bigint value can be as many as 22, but also much lower (if I've interpreted the source correctly).

Assuming the goal is always ensuring proper sort order, the value needs to left-padded with "0"s to the max character length (22 if I'm not mistaken).

@maxcountryman
Copy link
Owner

Currently the epoch is constructed from an arbitrary point in time (much later than 1970--somewhere close in time to when the process starts, in most cases). In other words, we aren't using the UNIX epoch anymore.

Adding padding doesn't sound like a bad idea if indeed the BigInt isn't already padded. I'm curious if there's a particular source that describes this behavior?

@bplatz
Copy link
Author

bplatz commented Jun 22, 2016

Just looking at the 64 bit timestamp, currently it would result in 7 base-62 encoded characters.

It won't be until 2082 it will jump to 8 characters long (from the current 7 characters):
(-> (java.time.Instant/parse "2082-01-01T00:00:00Z")
(.toEpochMilli)
(base62-encode)
count)
;;=> 8

I haven't checked it out for the full bytebuffer at 128 bits to see when the next digit appears, but it is possible it is a ways out. So your call.

If someone chooses this flake format and needs to generate past or future ids (for whatever reason) this issue increases.

Your base encoding function generates ids based on number size only, so nothing is padded.

@maxcountryman
Copy link
Owner

Perhaps the caller should pad the input? It might be dangerous to assume the padding should be implicitly padded. What do you think?

@bplatz
Copy link
Author

bplatz commented Jun 22, 2016

I think people will assume the encoded id maintains sort order, so I'd pad it.

I'd let them not pad it if they don't want the sort order guarantee.

The padding (assuming it is with zeros) won't affect decoding the ID back into the original bigint, so the integrity of the id is not affected.

@bplatz
Copy link
Author

bplatz commented Jun 22, 2016

FWIW, here is a pad function I use... I'm sure there is an optimization in there to be had:

(defn pad-left
  "Pad a string s to a length of l with character c."
  [s l c]
  (-> (- l (count s))
      (repeat c)
      (concat [s])
      (#(apply str %))))

@maxcountryman
Copy link
Owner

What length is the correct length to pad to in your opinion? My understanding is BigIntegers don't have a MAX_VALUE and instead are backed by an array of ints and thus their size limitations are correlated with available memory.

@bplatz
Copy link
Author

bplatz commented Jun 22, 2016

Your id is 128 bits, so 22 characters.

(-> (Math/pow 2 128)
    (dec)
    (base62-encode)
    (count))
;;=> 22

@maxcountryman
Copy link
Owner

Ah right, of course! Long day. I suppose after whatever point the BigInteger exceeds that length the IDs no longer work, but in theory that's far enough into the future to not matter.

@bplatz
Copy link
Author

bplatz commented Jun 22, 2016

Yes, very very far into the future, which is why I mentioned before you can safely add precision greater than millis to your first 64 bits, if you wanted and were OK with a change to your ids... but at this point seems you are making some changes anyhow. :)

@maxcountryman
Copy link
Owner

Right, I think increasing the precision is fine, since in theory it doesn't break anything. Moving around the bits on the other hand does. :)

By the way, I think the current character count varies from your estimate: generating IDs locally and coercing them to BigIntegers yields character counts of 32.

user=> (count (str (flake/flake->bigint (flake/generate!))))
32

@bplatz
Copy link
Author

bplatz commented Jun 22, 2016

That is the length of your integer, and you should have no issues with the integer sort order.

When you base-62 encode the integer, it is the resulting string sort order that can become incorrect.

Per your example, this should apply (I didn't test it though):

(count (encode/base62-encode (flake/generate!)))

@maxcountryman
Copy link
Owner

Doesn't it defeat the purpose of encoding if we end up making the encoded ID longer?

@maxcountryman
Copy link
Owner

I guess what I'm not clear about here is how we lose lexicographic ordering if the ID becomes longer (or shorter) relative to the timestamp bits?

I don't think I've understood your original report and possibly it's just too late in the day and I'm missing the obvious. :)

@bplatz
Copy link
Author

bplatz commented Jun 22, 2016

I'm only referring to the base-62 encoded flake id, not the bigint which should sort fine.

The smaller your bigint, the shorter the characters... following is an example.

(base62-encode 10)
;;=> "a"

(base62-encode 100)
;;=> "1C"

(sort ["1C" "a"])
;;=> ("1C" "a") ;;  wrong!!

(sort ["1C" "0a"])
;;=> ("0a" "1C") ;;  right!! - padded with a "0" to be same length.

@maxcountryman
Copy link
Owner

Okay, so your original issue was that if a timestamp occurred around 1970 IDs, specifically BigIntegers, would be short in length and thus sort incorrectly.

Makes sense and yes I agree that it's probably worth padding. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants