Skip to content
This repository has been archived by the owner on Jul 9, 2024. It is now read-only.

Discussion: Unix Timestamp Granularity in UUIDv7 #23

Open
kyzer-davis opened this issue Aug 10, 2021 · 55 comments
Open

Discussion: Unix Timestamp Granularity in UUIDv7 #23

kyzer-davis opened this issue Aug 10, 2021 · 55 comments
Labels
Discussion Further information is requested UUIDv7 All things UUIDv7 related

Comments

@kyzer-davis
Copy link
Contributor

kyzer-davis commented Aug 10, 2021

Question: Too many or too much?

Current Draft-02 implementation:

  • 36 bits to avoid the year 2038 problem.
  • If 32 bit timestamp, pad left/start of 32 bit to create 36 bit value in the UUIDv7
  • if 64 bit timestamp, truncate to last, right-most 36 bits for use in UUIDv7
  • 36 bits provides timestamp values up to the year 4147

Alternatives Proposed:

  • 34 bit timestamp, giving 2 bits back to entropy/node/randomness, valid up to the year 2514 source
@kyzer-davis
Copy link
Contributor Author

kyzer-davis commented Aug 10, 2021

I vote to leave at 36 in the current implementation, I would rather future proof as much as possible with a larger number than reduce the size and truncate a 64-bit timestamp more.

That being said: Depending on how we choose to encode subsec_a based on #24 we may end up with two more additional bits that are always padded. Might as well give them back to the unixts to go to 38 if that is the case.

@bradleypeabody
Copy link
Contributor

Yes, the 36-bit length was definitely done intentionally in order to avoid running out of numbers - but I do agree it's annoying to use.

Another option would be to use a 32-bit unsigned integer, which runs out in the year 2106. I'm not super excited about an end date that is close enough for my grandchildren to curse me for when it runs out on them, but it's an option and would definitely be simpler to use.

I'll write more on this on #24 in a moment.

@nerg4l
Copy link

nerg4l commented Aug 13, 2021

Bits Signedness End year
32 Signed 2038
32 Unsigned 2106
33 Signed 2106
33 Unsigned 2242
34 Signed 2242
34 Unsigned 2514
35 Signed 2514
35 Unsigned 3058
36 Signed 3058
36 Unsigned 4147

I think 32 bits even if unsigned are not enough. 2106 is not that far in the future.

I can also see the benefit of keeping the value unsigned because generating a UUID before 1970-01-01 shouldn't be a real life use case.

I definitely want to avoid introducing a new epoch. The Unix epoch makes implementations much easier. I don't even know why did they decide to use the Gregorian epoch in case of UUIDv1.

For me anything from 2514 looks good.

@bradleypeabody
Copy link
Contributor

bradleypeabody commented Aug 13, 2021

@nerg4l @kyzer-davis If we did the 111 var field from #26 for UUID7, that gives us the first 64 bits clear and free of any other requirements. What if we, despite how clever I thought I was being with this subsecond encoding stuff (I still like the idea but I don't like how much confusion it has created), instead just use a 64-bit nanosecond-precise unix timestamp. I'm not sure if this shoudl replace UUIDv7 or maybe be a separate version (we could do UUID9 ;) ). But regardless, I'm just trying to see if there is a way to simplify this whole thing down so when people read the spec it's like "oh of course, that makes sense" instead of "what the..."

So the new bit layout would be:

   0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    unix_nano_timestamp_u64                    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    unix_nano_timestamp_u64                    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | var |  ver    |    seq         |        rand                  |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                            rand                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Where unix_nano_timestamp_u64 is an unsigned 64-bit unix-like timestamp (but instead of seconds since the epoch it's nanoseconds since the epoch), seq is 8 bits for sequence and rand is random. Max time with this is 2554-07-21 23:34:33 UTC

And then we basically just say basically "implementations can parse out the time if the they like, but the other fields are essentially write-only and are a recommendation - don't try to read these because it's not invalid for implementations make changes after the ver-var field. And factually if implementations wanted to encode random stuff into the least significant bytes of the nanosecond-precise timeestamp - I don't think that should even be explicitly forbidden - it's just like "if you do this, other implementations may get the wrong timestamp - use at your own risk".

I think this would be dramatically simpler and easier to understand.

@nerg4l
Copy link

nerg4l commented Aug 14, 2021

@bradleypeabody I think the variable subsecond handling in UUIDv7 was added because not all languages can return unix timestamp with nanosecond precision eg. JavaScript, PHP.

Changing the position of the version bit would affect some wildly used libraries. Here are some examples:

Edit: It is always quite unfortunate when there are already existing implementations.

@fabiolimace
Copy link

fabiolimace commented Aug 14, 2021

Another important thing to keep in mind is how many UUIDs can be generated per unit of time. The more of a type of UUID that can be generated in a given interval, the better.

I did some simple math to get the maximum number of UUIDs version 1 that can be generated in a nanosecond interval. I treated all non-timestamp bits as random bits and obtained the value 461168601842738816 (2**62/10). So I used this number as a reference for comparison with the other UUID types. A fraction called "proportion" is used to express the comparison numerically.

These are the proportions:

  • UUIDv1: 1.00 (reference)
  • COMBv4: 0.04
  • UUIDv7-34b-sec: 0.67
  • UUIDv7-36b-sec: 0.17
  • UUIDv7-44b-msec: 0.66
  • ULID: 2.62 (cheats using 6 extra bits)

I was about to propose an alternative UUIDv7 with 44 bits for timestamp and millisecond precision. But when I did that calculation, I saw how difficult it is to find another layout that beats UUIDv1.

This are the maths:

UUIDv1
Time unit: 100-ns
Time bits: 60
Time limit: 5623 AD (260 / 107 / 60 / 60 / 24 / 365.25 + 1970)
Remaining bits: 62 (128-60-4-2)
UUIDs per nanosecond: 461168601842738816 (262 / 10)
Proportion: 1.00 (used as reference)

COMBv4 (with version and variant bits):
Time unit: ms
Time bits: 48
Time limit: 10889 AD (2
48 / 1000 / 60 / 60 / 24 / 365.25 + 1970)
Remaining bits: 74 (128 - 48 - 4 - 2)
UUIDs per nanosecond: 18889465931478580 (274 / 106)
Proportion: 0.04 (18889465931478580 / 461168601842738816)

UUIDv7-34b-sec (34 bits for seconds)
Time unit: s
Time bits: 34
Time limit: 2514 AD (234 / 60 / 60 / 24 / 365.25 + 1970)
Remaining bits: 88 (128 - 34 - 4 - 2)
UUIDs per nanosecond: 309485009821345088 (2
88 / 109)
Proportion: 0.67 (309485009821345088 / 461168601842738816)

UUIDv7-36b-sec (36 bits for seconds)
Time unit: s
Time bits: 36
Time limit: 4147 AD (2
36 / 60 / 60 / 24 / 365.25 + 1970)
Remaining bits: 86 (128 - 36 - 4 - 2)
UUIDs per nanosecond: 77371252455336272 (286 / 109)
Proportion: 0.17 (77371252455336272 / 461168601842738816)

UUIDv7-44b-msec (44 bits for MILLIS)
Time unit: ms
Time bits: 44
Time limit: 2527 AD (244 / 1000 / 60 / 60 / 24 / 365.25 + 1970)
Remaining bits: 78 (128 - 44 - 4 - 2)
UUIDs per nanosecond: 302231454903657280 (2
78 / 106)
Proportion: 0.66 (302231454903657280 / 461168601842738816)

ULID (no version or variant bits):
Time unit: ms
Time bits: 48
Time limit: 10889 AD (2
48 / 1000 / 60 / 60 / 24 / 365.25 + 1970)
Remaining bits: 80 (128 - 48)
UUIDs per nanosecond: 1208925819614629120 (280 / 106)
Proportion: 2.62 (1208925819614629120 / 461168601842738816)

@edo1
Copy link

edo1 commented Aug 14, 2021

	UUIDs per nanosecond: 461168601842738816 (2**62 / 10)

Wouldn't it be

	UUIDs per nanosecond: 46116860184273879 (2**62 / 100)

?

BTW, your calculator is wrong a bit.
https://play.golang.org/p/B-rzKVAJhNE

@fabiolimace
Copy link

fabiolimace commented Aug 14, 2021

@edo1

Yes. I made a silly mistake in the beginning. All the rest of the calculation is wrong! I will fix this.

Thanks for noticing!

@fabiolimace
Copy link

fabiolimace commented Aug 14, 2021

Now any UUIDv7 that we have talked about is better than UUIDv1. I use Python's Decimal() here to avoid division errors.

  • UUIDv1: 1.00 (reference)
  • COMBv4: 0.41
  • UUIDv7-34b-sec: 6.71
  • UUIDv7-36b-sec: 1.68
  • UUIDv7-44b-msec: 6.55
  • ULID: 26.21 (cheats using 6 extra bits)
  • UUIDv7-40b-10msec: 10.49
  • UUIDv7-48b-100msec: 4.10

Note: in this simple (naive?) calculation, all bits that are not timestamps are considered random.

UUIDv1
	Time unit: 100-ns
	Time bits: 60
	Time limit: 5623 AD (2**60 / 10**7 / 60 / 60 / 24 / 365.25 + 1970)
	Remaining bits: 62 (128-60-4-2)
	UUIDs per nanosecond: 46116860184273879.04 (Decimal(2**62) / Decimal(100))
	Proportion: 1.00 (used as reference)

COMBv4 (with version and variant bits):
	Time unit: ms
	Time bits: 48
	Time limit: 10889 AD (2**48 / 1000 / 60 / 60 / 24 / 365.25 + 1970)
	Remaining bits: 74 (128 - 48 - 4 - 2)
	UUIDs per nanosecond: 18889465931478580.854784 (Decimal(2**74) / Decimal(10**6))
	Proportion: 0.41 (Decimal(18889465931478580.854784) / Decimal(46116860184273879.04))
	
UUIDv7-34b-sec (34 bits for seconds)
	Time unit: s
	Time bits: 34
	Time limit: 2514 AD (2**34 / 60 / 60 / 24 / 365.25 + 1970)
	Remaining bits: 88 (128 - 34 - 4 - 2)
	UUIDs per nanosecond: 309485009821345068.724781056 (Decimal(2**88) / Decimal(10**9))
	Proportion: 6.71 (Decimal(309485009821345068.724781056) / Decimal(46116860184273879.04))
	
UUIDv7-36b-sec (36 bits for seconds)
	Time unit: s
	Time bits: 36
	Time limit: 4147 AD (2**36 / 60 / 60 / 24 / 365.25 + 1970)
	Remaining bits: 86 (128 - 36 - 4 - 2)
	UUIDs per nanosecond: 77371252455336267.181195264 (Decimal(2**86) / Decimal(10**9))
	Proportion: 1.68 (Decimal(77371252455336267.181195264) / Decimal(46116860184273879.04))
	
UUIDv7-44b-msec (44 bits for MILLIS)
	Time unit: ms
	Time bits: 44
	Time limit: 2527 AD (2**44 / 1000 / 60 / 60 / 24 / 365.25 + 1970)
	Remaining bits: 78 (128 - 44 - 4 - 2)
	UUIDs per nanosecond: 302231454903657293.676544 (Decimal(2**78) / Decimal(10**6))
	Proportion: 6.55 (Decimal(302231454903657293.676544) / Decimal(46116860184273879.04))
	
ULID (no version or variant bits):
	Time unit: ms
	Time bits: 48
	Time limit: 10889 AD (2**48 / 1000 / 60 / 60 / 24 / 365.25 + 1970)
	Remaining bits: 80 (128 - 48)
	UUIDs per nanosecond: 1208925819614629174.706176 (Decimal(2**80) / Decimal(10**6))
	Proportion: 26.21 (Decimal(1208925819614629174.706176) / Decimal(46116860184273879.04))
	
UUIDv7-40b-10msec (40 bits for 10-MILLIS)
	Time unit: 10-ms
	Time bits: 40
	Time limit: 2318 AD (2**40 / 10**2 / 60 / 60 / 24 / 365.25 + 1970)
	Remaining bits: 82 (128 - 40 - 4 - 2)
	UUIDs per nanosecond: 483570327845851669.8824704 (Decimal(2**82) / Decimal(10**7))
	Proportion: 10.49 (Decimal(483570327845851669.8824704) / Decimal(46116860184273879.04))
	
UUIDv7-48b-100usec (48 bits for 100-MICROS)
	Time unit: 100-us
	Time bits: 48
	Time limit: 2861 AD (2**48 / 10**4 / 60 / 60 / 24 / 365.25 + 1970)
	Remaining bits: 74 (128 - 48 - 4 - 2)
	UUIDs per nanosecond: 188894659314785808.54784 (Decimal(2**74) / Decimal(10**5))
	Proportion: 4.10 (Decimal(188894659314785808.54784) / Decimal(46116860184273879.04))

Using the General Birthday Formula

The General Birthday Formula applied to the UUID types:

UUID TYPE                    UUIDS PER NANOSECONDS  GENERAL BIRTHDAY FORMULA
UUIDv1             Decimal(2**62) / Decimal(10**2)        252846877.03432938
COMBv4             Decimal(2**74) / Decimal(10**6)        161822001.30197080
UUIDv7-34b-sec     Decimal(2**88) / Decimal(10**9)        655009407.54042970
UUIDv7-36b-sec     Decimal(2**86) / Decimal(10**9)        327504703.77021486
UUIDv7-44b-msec    Decimal(2**78) / Decimal(10**6)        647288005.20788320
ULID               Decimal(2**80) / Decimal(10**6)       1294576010.41576650
UUIDv7-40b-10msec  Decimal(2**82) / Decimal(10**7)        818761759.42553700
UUIDv7-48b-100usec Decimal(2**74) / Decimal(10**5)        511726099.64096063

The General Birthday Formula:

import math
from decimal import Decimal

def birthday_formula(t):
     return Decimal(math.sqrt(-2 * math.log(1-1/2)) * math.sqrt(t))
     
# Example on terminal
>>> uuid1 = Decimal(2**62) / Decimal(10**2)
>>> birthday_formula(uuid1)
Decimal('252846877.0343293845653533935546875')

The formula calculates the number of UUIDs which need to be generated in order to have a 50% probability of at least one collision.

Sources:

Update (2021-08-15)

  • Added the UUIDv7 with 40 bits for 10-ms precision
  • Added the UUIDv7 with 48 bits for 100-us precision
  • Added a table with values calculated from the General Birthday Formula.

@edo1
Copy link

edo1 commented Aug 14, 2021

I vote to leave at 36 in the current implementation

IMO the random part should be as big as possible so keeping 4 leading zero bits is a weird idea.

I think 32 bits even if unsigned are not enough. 2106 is not that far in the future.

Agree. The next option is 33, but the odd number of bits is a bit odd, lol.

UUIDv7-44b-msec (44 bits for MILLIS)

I like it.

UUIDv7-34b-sec (34 bits for seconds)

Is ok too.

@fabiolimace
Copy link

fabiolimace commented Aug 14, 2021

I played around with the 40-bit timestamps suggested by Brad in another issue.

If we use 40 bits with 10-milliseconds or centisecond precision:

  • The timestamp, sequence and random parts are aligned with bytes.
  • The precision is about 20 times shorter than human response times for simple reaction tasks, which are typically on the order of 200 ms. So I think the sort order will be appropriate in chat-like apps where people react to each other's messages.
  • The database will be happy with this precision (right?). If more precision is needed, the next 20 bits can be used (?), simulating a 10-nanosecond precision.
  • The random part is much larger than the 48 bits reserved for the node ID in UUIDv1.
  • The random part needs to be updated 10 times less frequently, consuming less entropy from the underling system, assuming that the random part should change whenever the timestamp changes.

Layout:

|             64 bits             |            64 bits             |
	
|--------------------|----VV------|R-------------------------------|

|    unixts / 100    |  sequence  |            random              |
                       24 - 4 bits
|       40 bits      |  = 20 bits |     64 - 2 bits =  62 bits     |


- : 2 bits
VV: 4 bits for version
R : 2 bits for variant

UUIDv7-40b-10msec (40 bits for 10-MILLIS)
    Time unit: 10-ms
    Time bits: 40
    Time limit: 2318 AD (2**40 / 10**2 / 60 / 60 / 24 / 365.25 + 1970)
    Remaining bits: 82 (128 - 40 - 4 - 2)
    UUIDs per nanosecond: 483570327845851669.8824704 (Decimal(2**82) / Decimal(10**7))
    Proportion: 10.49 (Decimal(483570327845851669.8824704) / Decimal(46116860184273879.04))
                ^^^^^

Let me know what you think and if I made another mistake.

PS: the layout shown here is based on ULID with sequence presented by @sergeyprokhorenko in the issue #27.

@edo1
Copy link

edo1 commented Aug 14, 2021

If we use 40 bits with 10-milliseconds or centisecond precision:

LGTM

The database will be happy with this precision (right?).

  1. Monotonic UUID (centralized generation only)
    The resolution doesn't matter at all.
  2. Non-monotonic UUID (both centralized and distributed generation)
    2.1. Regular (non-clustered) index
    IMO 10-ms resolution should be enough in almost any case.
    On very busy systems, the locality may not be good enough, but the timestamp can be easily extended as you suggest.
    2.2. Clustered index
    IMO there is no silver bullet, but improving timestamp resolution could help.

assuming that the random part should change whenever the timestamp changes.

I don't like this idea. This allows you to predict the next UUID value, which could be a security issue.

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Aug 15, 2021

I think that human response time is an outdated benchmark. We'd better think about data streems from many IoT sourses. Therefore the more unique are timestamp + sequence the better, because incoming UUIDs are more ordered. In most cases we can concider clock sequence empty or null. So the more different timestamps per second the better.

Therefore we'd better take timestamp precision not more than round trip network request in same data centre (500 microseconds). But we have to pay for high precision by increasing the bit depth of the timestamp. So I think the 1 millisecond timestamp precision is enough.

@fabiolimace
Copy link

I think that human response time is an outdated benchmark. We'd better think about data streems from many IoT sourses.

Yes. It's important to keep IoT in mind.

I updated my comment #23 (comment) to include two new timestamp alternatives:

  • timestamp with 40bits and 10-Milli precision (up to year 2318 AD)
  • timestamp with 48bits and 100-Micro precision (up to year 2861 AD)

Also included a table with values generated using the General Birthday Formula. All the bits that are not from timestamp are considered as random in that calculation.

@edo1
Copy link

edo1 commented Aug 15, 2021

Also included a table with values generated using the General Birthday Formula.

Unsure that your calculation is correct.

IMO it is better to calculate something like annual collision probability when every nanosecond new UUID is generated.

I use this formula from wiki to calculate probability per tick:
image
With large n we can simplify the probability of no collision to exp(-n^2/2d).

For UUIDv7-40b-10msec n=10e6, d=2^82; the probability of no collision per tick is
image

The annual (100*3600*24*365 ticks per year) probability of collision is
image ≈3.2%

For UUID v4 it will be one big tick, n=1e9*3600*24*365, d=2^122;
image ≈0.009% (a lot better!)

@fabiolimace
Copy link

fabiolimace commented Aug 15, 2021

Excellent, @edo1!

Now we have a mathematical tool to help us evaluate all the options and trade offs involved. I don't care if my calculations were wrong if it encouraged someone to do something a lot better.

@edo1
Copy link

edo1 commented Aug 16, 2021

Have made a spreadsheet with some UUID variants.

  • UUID v1, UUID v4, ULID — existing standards;
  • UUID v7 — this draft, 122 effective bits (6 bits reserved: 4 version + 2 variant); several timestamp resolutions are considered;
  • UUID variant 3 — similar to the previous one, but the version field is omitted (as mentioned in Discussion: Redefine variant bit (111) definition #26 (comment)), only 2 bits (variant) are reserved;
  • UUID var3/whole bytes — similar to the previous one, but the length of the timestamp is a multiple of a byte.

In all cases, all effective non-timestamp bits are considered random.

The target is P collisions/year: the annual collision probability when every nanosecond new UUID is generated. Best variants in each group (UUID v7, UUID variant 3, UUID var3/whole bytes) are marked in green.

You could make a copy of the spreadsheet and play around with constants/formulas.
Feel free to comment or ask any questions.

@fabiolimace
Copy link

fabiolimace commented Aug 16, 2021

Great, @edo1 !

We can adjust the life span of the UUIDv7 in the field "years after now", right? It's interesting how the number of years affects the probability of collision. But why don't we use 200 years?

The life span of 120 shows that UUIDv7 with 39 bits and centisecond precision is the best option. But I prefer not using this number of bits to prevent someone storing this type of UUID in a SIGNED integer field. If this occurs, the "last date" becomes 2057, not 2144.

If we adjust the life span to 200 years, the "best" option becomes 40 bits (5 whole bytes), and the last date becomes 2318 (unsigned) or 2144 (signed)(witch is 123 years after now).

Please change the fields "stolen bits" of UUID variant 3 to 3. The bit sequence for this variant is '111'.

@edo1
Copy link

edo1 commented Aug 16, 2021

We can adjust the life span of the UUIDv7 in the field "years after now", right? It's interesting how the number of years affects the probability of collision. But why don't we use 200 years?

You can make your own copy and change anything
image

Or you can download the spreadsheet in the Excel/OpenOffice format and change the numbers locally.

But I prefer not using this number of bits to prevent someone storing this type of UUID in a SIGNED integer field

Any reason to use signed int here?
A 64-bit signed int can be truncated to 39-bit unsigned without any problems.

Please change the fields "stolen bits" of UUID variant 3 to 3. The bit sequence for this variant is '111'.

Oops, thanks for pointing that out

@edo1
Copy link

edo1 commented Aug 16, 2021

Actually, I do not exclude the option of making the epoch even shorter.

last date is probably a bad name. Nothing really bad should happen when an integer overflow occurs.
The probability of collisions will increase with each cycle, but with a long timestamp, we will have about the same increased probability right now.
Timestamp wrapping is not very good in terms of data sortability and locality, but still, the situation will be much better than with a random UUID

And I don't expect a very long lifespan for 128 bit UUIDs. It is very likely that in 50 or 100 years, 256-bit identifiers (or something radically new) will be used.

@sergeyprokhorenko
Copy link

I think we could start a new epoch from 2021 instead of use of Unix time

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Aug 16, 2021

And I don't expect a very long lifespan for 128 bit UUIDs. It is very likely that in 50 or 100 years, 256-bit identifiers (or something radically new) will be used.

160-bit identifiers would be great right now. By the way, 160-bit identifiers in Crockford's base32 would be the same length as 128-bit identifiers in usual UUID string format. Therefore they are compartible.

@nerg4l
Copy link

nerg4l commented Aug 18, 2021

@edo1

[...] The target is P collisions/year: the annual collision probability when every nanosecond new UUID is generated. [...]

Is the spreadsheet correct? The "when every nanosecond new UUID is generated" part in case of ns precision would mean that every nanosecond the time is incremented and because of that the probability of collision is theoretically 0.

@broofa
Copy link
Contributor

broofa commented Aug 18, 2021

Might as well give them back to the unixts to go to 38 if that is the case.

...

  • 34 bit timestamp, giving 2 bits back to entropy/node/randomness

The way I read RFC4122, it discourages fields that are a non-integral number of octets in length. (I.e. field bit lengths should be a multiple of 4 bits. So 32, 36, 40 are okay. 34 and 38 are not.)

From the RFC:

To minimize confusion about bit assignments within octets, the UUID record definition is defined only in terms of fields that are integral numbers of octets

I know this is just expository text in the RFC, but I believe it's well-founded guidance. Sub-octet field definitions make it difficult for humans to interpret a UUID. Witness how the hex-digit containing the variant field changes based on the clockseq value. 😦

@edo1
Copy link

edo1 commented Aug 19, 2021

Is the spreadsheet correct? The "when every nanosecond new UUID is generated" part in case of ns precision would mean that every nanosecond the time is incremented and because of that the probability of collision is theoretically 0.

You are right. Initially, I assumed that n is large, so the formula does not work correctly in the edge case: n*(n-1)≉n^2 when n=1.
Thank you for pointing this out. Fixed.

Also added the second sheet with graphs, they clearly show collision probability mainly depends on bits count and on epoch length, not on timestamp/random split.

@bradleypeabody
Copy link
Contributor

In https://github.com/uuid6/uuid6-ietf-draft/blob/master/LATEST.md, I'm proposing that we use an unsigned 64-bit nano-seconds since the Unix epoch timestamp (and move the version field out of the way - so it's exactly the first 8 bytes). Max timestamp value is 2554-07-21T23:34:33.999999999Z. The rationale is:

  • Year 2554 is a reasonably long time
  • Easy to understand and implement (clean byte boundaries)
  • Doesn't introduce new Epoch (not worth changing to add 51 more years)

Thoughts?

@kyzer-davis kyzer-davis changed the title Discussion: Unix Timestamp in UUIDv7 Discussion: Unix Timestamp Granularity in UUIDv7 Feb 24, 2022
@kyzer-davis
Copy link
Contributor Author

Draft 03 Timestamp Granularity now features a one-stop-stop section of all the lessons learned from our threads on timestamp granularity.
Please give it a review and let me know if I missed something.

@edo1
Copy link

edo1 commented Apr 3, 2022

Spreadsheet has been updated again to match the final version.

In this category, the situation is clearly worse: the annual collision probability at 1,000,000,000 UUIDs per second is more than 80% (or ≈0.17% at 1,000,000 UUIDs per second).

On the other hand, at 19,500 UUIDs per second it is safe enough (the probability of a collision is no more than 1:1,000,000,000 in 50 years).
Update: there was an error in the formula, correct number in #23 (comment)

@broofa
Copy link
Contributor

broofa commented Apr 4, 2022

In this category, the situation is clearly worse: the annual collision probability at 1,000,000,000 UUIDs per second is more than 80% (or ≈0.17% at 1,000,000 UUIDs per second).

@edo1 Something seems wrong here. Why would the probability of collision increase over time for timestamp-based UUIDs?

For such UUIDs, collisions can only occur for ids with the same timestamp. I.e. collisions are scoped to a specific timestamp interval, thus, it doesn't really matter how many intervals there are. The only factor should be the UUID creation rate, since that's what effects how many UUIDs are generated per interval.

For example, with "UUID v7 final" creating 109 UUIDs / second corresponds to 106 UUIDs / timestamp interval. Doing the birthday problem math for that (106 random 73-bit values) I get a collision probability of 5.3 x 10-11.

I.e. I think "P collisions/year" should be 0.0000000053%, not 82%.

Edit to add: This is one reason why comparing v4 and v1 (or, now, v7) is always a bit of an apples-to-oranges comparison: v4 depends on total # of ids produced, v1 depends on creation rate. (Generally speaking, the former is more collision resistant until you've generated some very-large number of ids.)

@edo1
Copy link

edo1 commented Apr 5, 2022

For example, with "UUID v7 final" creating 109 UUIDs / second corresponds to 106 UUIDs / timestamp interval. Doing the birthday problem math for that (106 random 73-bit values) I get a collision probability of 5.3 x 10-11.

Correct

I.e. I think "P collisions/year" should be 0.0000000053%, not 82%.

No. This is the probability that a collision will occur per millisecond.
Every millisecond we have 0.0000000053% probability that a collision will occur, and Pms ≈ 99.9999999947% that it won't.

Now for a year:
Pyear = (Pms)1000⋅3600⋅24⋅365 ≈ 0.9999999999471000⋅3600⋅24⋅365 ≈ 0.188

The chance of not having a single collision per year is less than 20%.

@martinheidegger
Copy link

The collision chance for the random part is indeed high, but since the timestamp is part of the ID, there will be no ID collision, right?

@edo1
Copy link

edo1 commented Apr 5, 2022

@martinheidegger there will be no collisions of UUIDs with different timestamps, you are right.
But collision of UUIDs with the same timestamp is still possible.
Even a very low collision probability of 0.0000000053% can rise to more than 80% in 100⋅3600⋅24⋅365 ≈ 3⋅1010 iterations (there are 100⋅3600⋅24⋅365 ms in a year)

@martinheidegger
Copy link

Implementing UUIDv7 properly, only UUID's created within the same ms / ns will have the same timestamp?! Why try to calculate it for the timeframe of a year?

@broofa
Copy link
Contributor

broofa commented Apr 5, 2022

This is the probability that a collision will occur per millisecond
...
Now for a year...

Ah, right. Birthday problem logic still applies based on the number of time intervals. 'Forgot about that. (This isn't the first time I've made this mistake. 😝 )

@martinheidegger: I believe @edo1 is correct here. Timestamps do prevent collision across time intervals, but there's a non-zero chance the random portion of ids in the same interval will collide. Add up the odds of all the [astronomic number of] permutations where one or more intervals might contain a collision and you get the 82% 81% number.

All: So what's the consensus opinion on the fact v7 has poorer collision resistance than v1 or other versions? Worth worrying about?

We could shorten the timestamp field to 44-bits, as discussed previously. That brings the probability down to ~10%. But... does that 8x improvement from 80% -> 10% really matter? We're dealing with such extreme use cases here that improvements that are less than an order of magnitude in nature probably won't move the needle much in terms of how we do / don't satisfy user requirements.

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Apr 5, 2022

If there is only one UUID generator per database table, then the combination of the timestamp with the counter guarantees 100% that collisions will never occur. It's like autoincrement. We don't even need a pseudo-random segment.

If there are multiple generators, then adding the generator ID to the right of the UUID will give the same 100% effect. This is allowed by clause 6.9 of the published Internet-Draft.

All calculations above ignore the number of UUID generators per database table. Therefore, all these calculations are wrong. The statement that v7 has poorer collision resistance than v1 or other versions is false.

The more UUID generators per database table, the higher the chance of a collision. With a small number of UUID generators, counters are more effective at preventing collisions than pseudo-random segments. Counters act like timestamps.

@edo1
Copy link

edo1 commented Apr 5, 2022

If there are multiple generators, then adding the generator ID to the right of the UUID will give the same 100% effect. This is allowed by clause 6.9 of the published Internet-Draft.

Do you mean 6.3?

Node IDs:
With this method, a pseudo-random Node ID value is placed within the UUID layout. This identifier helps ensure the bit-space for a given node is unique, resulting in UUIDs that do not conflict with any other UUID created by another node with a different node id. Implementations that choose to leverage an embedded node id SHOULD utilize UUIDv8.

Or do you mean increasing the UUID length with additional fields? Yes, it can solve this problem. But this breaks the idea of compatibility with RFC4122, which is already implemented in DBMS, e.g. PostgreSQL uuid, MS SQL uniqueidentifier.

@sergeyprokhorenko
Copy link

Do you mean 6.3?

No. I mean 6.9: "DBMS vendors are encouraged to provide functionality to generate and store UUID formats defined by this specification for use as identifiers or left parts of identifiers...". It's the recommended UUIDv7

Or do you mean increasing the UUID length with additional fields?

I mean the extra segments (additional random segment + metadata) to the right of the UUID in the long concatenated identifier.

But this breaks the idea of compatibility with RFC4122

Not always. It depends on the circumstances. The database developer can allocate a field in the database table for the long concatenated identifier, or he/she can break the long identifier into UUID and metadata. Сompatibility is not always good. I previously wanted a 160-bit UUID with metadata inside, but a concatenated identifier is an even more flexible technical solution. UUID is a standardized feature of the UUID generator (of the DBMS), while concatenated identifier is a custom feature of a particular database table.

We should not adapt to the limitations of the current DBMS. On the contrary, DBMS developers will have to adapt to new UUID versions.

@kyzer-davis
Copy link
Contributor Author

kyzer-davis commented Apr 6, 2022

"1,000,000,000 UUIDs per second" collision test

I would assume the crux of the issue is UUIDv7 features MS resolution while UUIDv1 has 100-NS. With this we are paused on the timestamp for longer while the rest of the UUID entropy is doing it's thing. That is, UUIDv7's timestamp bit does not advance as fast as UUIDv1 which means more entropy rolls per timestamp bit tick. Thus more rolls = more collision probability.

The problem that I see with this test is that the last column shows "Safe UUIDs per second" favors UUIDv7 over UUIDv1. This confused me a tiny bit, how can UUIDv7 beat UUIDv1 per second but lose the race over time? Has been fixed.

What does this look like if we say: "generate 1 million UUID per timestamp bit tick" so the actual real-world time does not matter and we are comparing entropy characteristics only. This would be a fair comparison since both timestamps are paused for 1 million random entropy generations. Run that test on both the same N number of times to keep it fair. My theory is that this has to favor UUIDv7 because I have a hard time wrapping my head around how 48 bit timestamp vs 60 bit timestamp where there are less entropy bits comes out to worse entropy characteristics when there are more bits dedicated to that that in UUIDv7 vs UUIDv1.


Thinking about 44 vs 48 @broofa discussed: It is true, we are talking about the extremes. I don't see why we further reduce the timestamp in favor of more entropy which counteracts the point above where we now pause on each tick longer increasing entropy rolls and thus more chances for a collision. Could the other method prove better? e.g. Take Unix TS to 64-bit Nanoseconds and reduce entropy bits? This means the TS will advance faster and we roll entropy less each tick.


Now that all being said, remember that we should be using an embedded counter from Section 6.2 for same-timestamp batch UUIDv7 creation and this is one of the perfect scenarios to illustrate it's usefulness. Here each counter tick within the frozen timestamp ensures we have a unique bit-space for the UUID entropy generation (with bonuses on storability as a secondary characteristics.)

Let's run through a scenario:
In our system we have a need to create up to 1,000,000 UUIDs per timestamp MS tick (even more granular than your per second test).

In this scenario implementers should select an increment method from Draft 03:

  • An embedded counter of 20 bits should be sufficient here but let's use N-1 (21-1) with topmost bit as rollover counter
  • Monotonic random of 74 bits (rand_a, rand_b)

Then select an increment type. (To keep things easy let's say plus one increment type)

Embedded counter using N-1 rollover guard and plus one increment:

First UUID

  • Timestamp MS Tick
  • Generate Random Counter 20 bits in least-significant right-most position. 21st left-most rollover guard 0
  • Generate random

Second UUID

  • Timestamp same as last
  • Increment counter by 1 in least significant, right-most, position.
  • Generate new random

...Repeat for next set of UUID up to the millionth or timestamp moves forward. Incrementing rollover counter where required.

Alternative method for random as counter with plus one increment type:

First UUID

  • Timestamp MS Tick
  • Generate Random

Second UUID

  • Timestamp same as last
  • Increment previous random by 1

...Repeat for next set of UUID up to the millionth or timestamp moves forward.

@edo1
Copy link

edo1 commented Apr 6, 2022

The problem is that the last column shows "Safe UUIDs per second" favors UUIDv7 over UUIDv1. This confused me a tiny bit, how can UUIDv7 beat UUIDv1 per second but lose the race over time?

OMG! There was a foolish error in the formula (the wrong row number was substituted during copy-paste).
Fixed, thank you.

Only ≈3 UUIDs per millisecond can be considered safe (less than 1:1,000,000,000 collision chance in 50 years).

@kyzer-davis
Copy link
Contributor Author

kyzer-davis commented Apr 6, 2022

Thanks @edo1,

One other problem in the spreadsheet: UUIDv7 in Draft 03 has 74 bits of entropy (rand_a (12) and rand_b (62)) and your document has 73. This may be a bad copy paste from the ol' E variant which stole a bit form entropy but is no longer relevant.
https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-03.html#name-uuid-version-7

That being said, I think the overall test is moot because it ignores the other characteristics of the UUIDv7 draft which solve the collision problems via methods other than increasing/decreasing timestamp or entropy.

That is, section 6.2 solves bulk UUID generation within a single timestamp via embedded counter logic. If we are doing bulk gen as this test implied then we should be selecting a counter method and increment type. Assuming the counter does not overflow, we should have 0 collisions.

Furthermore, Section 6.3 can be layered onto 6.2 and solve the problem for adding N number of nodes. Assuming everybody has a unique ID; our bitpace does not conflict with anybody else within the application context removing collisions.

@edo1
Copy link

edo1 commented Apr 7, 2022

One other problem in the spreadsheet: UUIDv7 in Draft 03 has 74 bits of entropy (rand_a (12) and rand_b (62)) and your document has 73.

I meant the mechanism from this paragraph:
Implementations utilizing fixed-length counter method MAY also choose to randomly initialize a portion counter rather than the entire counter. For example, a 24 bit counter could have the 23 bits in least-significant, right-most, position randomly initialized. The remaining most significant, left-most counter bits are initialized as zero for the sole purpose of guarding against counter rollovers.

That being said, I think the overall test is moot because it ignores the other characteristics of the UUIDv7 draft which solve the collision problems via methods other than increasing/decreasing timestamp or entropy.
That is, section 6.2 solves bulk UUID generation within a single timestamp via embedded counter logic.

Of course, in the case of the centralized UUID generation, it is easy to avoid collisions.
Much more interesting is the case of distributed UUID generation on autonomic nodes.

Assuming everybody has a unique ID

In many cases, it is impossible/nearly impossible to assign each node a short unique ID.
E. g. a smartphone app collects some data in offline mode and sends it to the server a day after. At some point, 1,000,000 new users may decide to install this application. Or someone might decide to run 1,000,000,000 instances of the application in a simulator.

And doesn't this paragraph mean that UUID v7 cannot use embedded node ID?
Node IDs:
With this method, a pseudo-random Node ID value is placed within the UUID layout. This identifier helps ensure the bit-space for a given node is unique, resulting in UUIDs that do not conflict with any other UUID created by another node with a different node id. Implementations that choose to leverage an embedded node id SHOULD utilize UUIDv8.

@kyzer-davis
Copy link
Contributor Author

@edo1, great discussion.
I do love the data modeling and probability checks! Don't think I am discouraging this type of work.

Personally I would love something we could provide to folks that can be used to test the "required number of bits for fixed-length counter length within their real-world application" e.g "I want to be able to generate 1,000,000 UUIDs per MS timestamp Tick" so on paper using 21 bit fixed-length counter along with N-1 Rollover Guard should be sufficient; but in reality the application compute can only generate 500,000 per MS tick resulting in some wasted counter bits that could have been better allocated to entropy.


My key points on the counters:

  • The N-1 Rollover guard as the reason for 73 vs 74 assumes you are using a counter. From what I can tell, this test is not. So for your purpose you would want to set it to 74.
  • A bit more clarity on N-1 and the rollover guard:
    • That N-1 Rollover guard would mainly be the fixed-length dedicated between timestamp and entropy (timestamp|counter|entropy). Here things are a bit more cramped and we need to ensure counter can be randomly initialized but still have room for the desired number of increments.
    • With random as a counter but this is already somewhat baked in based on how this counter would function. e.g you increment the lowest bit(s) and any higher bits are already rollover counters. For a plus one type A increment we have 73 higher order bits for rollover. For a Type B, random increment (N-<Entropy_Length>) and assuming 20 least sig, right-most bits are rolled as entropy then added to the previous counter; we have 54 rollover bits.

Distributed Node Generation:

  • Yes, true, this does pose a problematic scenario if you use raw timestamp|entropy as the only values. A counter should still be used when doing batch, even if the batches are across disparate nodes to avoid collisions on your own node.
  • You can use Node IDs in v7, this is a SHOULD not a MUST. The logic of steering towards v8 is in flexibility provided by that layout. If you are using node IDs for distributed node generation you may also want to tweak other things for example our case here: lengthen the desired epoch timestamp to 64 bits at NS resolution for less "entropy pause and roll" before the timestamp moves on. Then layer in other desired aspects like counters, node IDs, and entropy length to have your own standard compliant time-based UUID.

@sergeyprokhorenko
Copy link

SHOULD utilize UUIDv8 looks like a ban for version 7. A more unambiguous phrase is desirable.

@kyzer-davis
Copy link
Contributor Author

@sergeyprokhorenko, Unfortunately my verbiage in this document is highly restricted by RFC2119; of which the descriptions are anything but unambiguous.

SHOULD seems to fit the bill for what I am trying to convey but if there is a reason to convert it to MAY that is also fine with me. Both copied from RFC 2119 for reference below:

3. SHOULD This word, or the adjective "RECOMMENDED", mean that there
may exist valid reasons in particular circumstances to ignore a
particular item, but the full implications must be understood and
carefully weighed before choosing a different course.

5. MAY This word, or the adjective "OPTIONAL", mean that an item is
truly optional. One vendor may choose to include the item because a
particular marketplace requires it or because the vendor feels that
it enhances the product while another vendor may omit the same item.
An implementation which does not include a particular option MUST be
prepared to interoperate with another implementation which does
include the option, though perhaps with reduced functionality. In the
same vein an implementation which does include a particular option
MUST be prepared to interoperate with another implementation which
does not include the option (except, of course, for the feature the
option provides.)


The other point I am trying to make here is that UUIDv8 is a totally valid use case for edge scenarios like this test is conveying. UUIDv7 in Draft 03 works for majority of use cases. If an application implementer vets v7 and needs make some changes, there is nothing stopping them from forking v7 into v8, making those changes and calling it a day. UUIDv8 gives them the opportunity to do what they please while having a standard compliant UUID.

@sergeyprokhorenko
Copy link

@kyzer-davis, There is a good example in section 6.9. DBMS and Database Considerations: DBMS vendors are encouraged to...

You can use it for this text: "Implementations that choose to leverage an embedded node id are encouraged to utilize UUIDv8 for more flexible layout. They MAY also choose to concatenate the UUID with the node id (and other elements) on the right into a single identifier".

Italics is my addition.

@edo1
Copy link

edo1 commented Apr 8, 2022

You can use Node IDs in v7, this is a SHOULD not a MUST

You are right, of course. Sorry.

Anyway, I dislike the idea of embedded ID as a generic approach, it is complicated and/or error-prone (requires proper ID length size selection, correct unique ID generation, ID reuse, and so on). It might be used in some specific cases, of course.

Personally I would love something we could provide to folks that can be used to test …

I tried to say it in #60 (comment)
Tunable counter length, method, increment types, guard bits, embedded ID length, etc create a mess.
No one will use a UUID generator with 100500 customizable parameters that requires reading a 500-page manual.

IMO should be 3 main options (golang is used in examples):

  1. Monotonicity is not required at all (fully-random UUID v4);
uuid, err := uuid.NewV4()
  1. Monotonicity is desired for performance reasons (UUID v7 without counter);
uuid, err := uuid.NewV7()
  1. Local monotonicity/collision-free is guaranteed (UUID v7 with counter).
uuidGenerator, err := uuid.NewMonotonyGenerator()

uuid, err := uuidGenerator.NewV7()

For PostgreSQL it could be uuid_generate_v4() (already implemented), uuid_generate_v7_fast() and uuid_generate_v7().

This will handle almost all cases of UUID generation. Only rarely is a more sophisticated approach required.
For now, the standard draft has no explicit recommendations for those main use cases.

@edo1
Copy link

edo1 commented Apr 8, 2022

Added UUID v7 with 74-bit randomness into the spreadsheet, it's the penultimate prone to collisions.

@edo1
Copy link

edo1 commented Apr 8, 2022

UUIDv8 gives them the opportunity to do what they please while having a standard compliant UUID.

Why I prefer UUID v7: UUIDs from different origins are still monotonic/almost monotonic. UUID v8 makes no guarantees at all.

@sofigio
Copy link

sofigio commented Jul 16, 2023

Spreadsheet has been updated again to match the final version.

I read most of the conversation, I'm trying to figure out if I could switch from fully-random (16 random bytes) uuids to v7, however am I reading it right? The safe value is as low as 5000 UUIDs per second?

image

Why did they choose to allocate so much space to the timestamp? It looks very irrational, considering that we already had 8 uuid releases in 20 years, to think that this one will be the ultimate version that will last 8.000 years...

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Jul 16, 2023

sofigio, I advise you to look at more advanced options of UUIDv7: x4m/pg_uuid_next#1 (comment) . Even without a random part at all (alike autoincrement), they guarantee the absence of collisions and the high speed of generating UUIDs. But they also have a fairly long random part too. All lazy programmers rushed to implement the most primitive UUIDv7 structure.
UUID structure

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Discussion Further information is requested UUIDv7 All things UUIDv7 related
Projects
None yet
Development

No branches or pull requests

9 participants