general requirements for a typical unique ID generator system:
- uniqueness, of course. every ID should be unique and there should be no collisions.
- distributed. multiple systems should seamlessly maintain uniqueness in a distributed environment without collisions.
- low latency. the generation should be quick, able to generate thousands of IDs per second.
- ordering or you can say monotonicity. this is optional but the generations are often required to be time-ordered.
- compact. IDs are often required to fit in under specific limits. for example, 64-bits.
- different configurations. sometimes we need the ID to be all numbers and sometimes, alpha-numeric.
the above are some (not all) of the configurations we can expect in a unique ID generator based on the requirements of the application.
we have several different options for generation of unique IDs:
- database auto-increment sequences.
- multi-server replication with step increments.
- universally unique identifiers (UUID).
- ticket servers.
- twitter's snowflake.
- universally unique lexicographically sortable identifiers (ULID).
- truetime API (google spanner).
- k-sortable unique IDs (KSUID).
there are several others but not so popular.
let us discuss one by one:
Database auto-increment sequences:
most straight-forward. it has auto-increment set and whenever an ID needs to be generated, it will be just +1 of the previous ID. this thing actually comes default with the database, mostly with relational databases.
- usually small integers (32-bit or 64-bit).
- IDs are always monotonically increasing. so, the ID generated at a later time always will be greater than the one generated before. hence time-ordering is true.
- obviously no duplicates. databases can maintain these with concurrency.
so, what's the catch??
- can't handle distributed environment.
- SPOF (single point of failure). when the instance of DB is down, ID generation halts.
- the IDs are generally predictable, as they are just auto-incremented by 1. not a major application-end issue, but a security concern in sensitive scenarios.
used in traditional relational applications such as CRM, financial systems, etc.
Multi-server replications with step increments:
this is like a direct big sister of the above method. adapts the same idea but for a distributed environment. instead of auto-increment by 1, you increment by the number of servers in the distributed environment.
suppose if you have n servers, each server will be assigned with an offset in the range [1, n]. each server generates IDs by incrementing the assigned offset by n for each new ID generation.
for example, with 3 servers:
-
Server 1 IDs:
1, 4, 7, 10 … -
Server 2 IDs:
2, 5, 8, 11 … -
Server 3 IDs:
3, 6, 9, 12 … -
no coordination required between the servers. hence, very less overhead.
-
usually compact (32-bit or 64-bit).
so, what's the catch??
- no global-ordering. no time-ordering. because, the ID generation is completely local and some higher sequence number can be generated in one server before another server even started generating IDs.
- re-balancing problem. adding or removing servers would require updating offsets and step.
- and of course, these are predictable and hence may raise security concerns.
used in early sharded systems.
Universally Unique Identifiers (UUID):
a decentralized unique ID scheme. almost available in every development environment. there are multiple versions in UUID each with it's own improvements over the relative older versions.
it is a 128-bit identifier. usually shown as 36 character alpha-numeric string. (hex + dashes)
variants (not important, you can skip this small part):
v1:
-
timestamp + MAC address + clock sequence.
-
partially ordered by time. (we need to extract time separately in order to maintain time-ordering hence, partially sortable.) v4:
-
most common one. random ID generation. 122 bits.
-
very low collision probability.
-
no time-ordering. v7:
-
unix time (epoch time - 48 bits) + randomness (76 bits).
-
time-ordering is true as we maintain sequence counter when two IDs are generated in the same millisecond.
-
decentralized. machines are independent, they generate their own random IDs and in most cases, uniqueness is maintained.
-
more bit space hence, so many possibilities. hence, very less collision probability.
so, what's the catch??
- 128 bit is a bit more (actually 64 bits more). larger space, slower joins.
- no time-ordering in v4.
- not enough randomness in v1.
v1 leaks machine identity and v7 has no time-ordering but introduces extra overhead of maintaining clock synchronization and other complexities. hence, in general use-cases where ordering doesn't matter and 128-bit is quite okay, systems often use UUID v4.
Ticket servers:
this is like a younger sister to the auto-increment approach. we maintain a single server which maintains the entire unique ID generation. whenever a server in a distributed environment requires an ID, it makes a request to the ticket server and ticket servers increments and passes a new ID.
- IDs are time-ordered.
- usually 64-bit, maintainable and compact.
so, what's the catch??
- of course, single point of failure. if ticket server is down, there the ID generation halts.
- in a larger distributed system, huge number of requests going through a single server leads to contention.
- again, if we want to have multiple ticket servers, that would require additional configurations such as leader-follower pattern, etc.
Twitter's snowflake:
twitter has introduced snowflake ID generation which generates 64-bit IDs. each ID contains multiple parts.
| 1 bit | 41 bits timestamp | 10 bits machine config | 12 bits sequence |
first bit is added for any future use, probably as a sign-bit. the next 41 bits refers to the timestamp. we take an epoch and we will get the timestamp (mostly in ms) relative to the epoch. next 10 bits includes the data-center and the machine configurations (usually 5 bits for each). the last 12 bits refers to the sequence. it refers to the sequence number of the current ID generated in that very millisecond on that machine in that data-center. which means, a machine can generate 4096 IDs in a millisecond.
- if a node can have 4096 IDs per ms, then a node can generate around 4 million IDs per second. high throughput.
- time-ordering is true. as we are maintaining the timestamp as part of the metadata of an ID.
- decentralized. no single point of failure.
- 64-bit integers. quite compact and maintainable.
so, what's the catch??
- clock synchronization should be maintained throughout every node in a distributed environment. or else, the time-ordering fails.
- have to mention that the lifetime is limited. because, the timestamps are still from an epoch and the bit allocation is limited to 41 bits.
- of course, some security concerns because metadata is exposed through the IDs.
Universally Unique Lexicographically Sortable Identifiers (ULID):
they are the sortable alternative to the UUID v4. they are also 128-bit but, they are divided into two parts:
- first 48-bits: timestamp (since an epoch).
- following 80-bits: randomness.
uses crockford's base-32 algorithm to generate 26 length alpha-numeric strings.
- many possibilities in a single millisecond. per millisecond, we have possibilities. highly scalable.
- improved compactness when compared to UUID. 26 characters < 36 characters - avoids dashes.
- lifetime is limited but very large. possibilities, longer than the lifespan of the universe (yes, when compared to snowflake approach, 7 bits make a very large impact).
so, what's the catch??
- no time-ordering. yes, we maintain timestamp, but the IDs generated in the same millisecond still remain random and scattered.
- more overhead in the generation part.
- still not quite compact as it is 128 bits.
TrueTime API (Google Spanner):
truetime api is a functionality used by the google spanner database service. truetime represents time as an interval of uncertainty instead of actual timestamp. it gives external consistency and strict serializability.
spanner uses the truetime api for the timestamp records. it returns the time interval which surely contains the absolute time (exact time at which the call made).
technically, TT.now() returns an interval [earliest, latest]. this surely contains the absolute time.
if then where is the absolute time at which the event has happened. there is also some uncertainty bound maintained that determines the interval, it is usually represented with . we call this interval as uncertainty window or interval. in general, this interval is incredibly narrow mostly limited to a few milliseconds.
time references here are actually GPS and atomic clocks. they are deployed at all google data-centers. why two modes of references, you may ask? because, to maintain distinguished fault tolerances. GPS may be impacted by faulty radio frequencies but atomic clocks won't be impacted by the same.
why interval instead of just timestamps?
this is to maintaining high consistency by leveraging the intervals instead of absolute time. systems wait until the time interval for a particular to elapse before committing the transaction. systems make use of uncertainty window or uncertainty interval in order to maintain high consistency.
for example,
- system A has a deposit transaction of Rs. 10 into an account with balance Rs. 100. so, system A fetches the uncertainty interval from the true time api. it gives an interval
[10:00, 10:10].10:00is the earliest and10:10is the latest. for any record, we will always consider the latest as the timestamp. hence,10:00will be considered as the timestamp. (these are not actual timestamps, actual timestamps are usually millisecond count from an epoch and also differs by a few milliseconds.) - now, at the very time system B has a withdraw request of Rs. 110. now, system B may have gotten an interval overlapping with the interval that is presented to the system A in the above transaction. for instance, time interval is
[10:05, 10:15]. - generally, if we maintain a direct timestamp, this may result in a conflict. if the withdrawal request was committed before the first deposit request then, the withdrawal request may have happened on the lower balance. but, system B won't commit it's status.
- it waits till the latest time of the transaction elapses. meanwhile, system A commit it's transaction as it's latest time elapses increasing the balance. now when system B commits the withdrawal will be successful.
this is a very naive example how the spanner database works utilizing the true time api. this method works extremely in a large distributed systems with a number of datacenters.
so, what's the catch? actually, this approach is considered to be near to perfect but,
- it requires specialized hardware with GPS and atomic clocks with necessary configurations.
- lesser complex than eventual consistent systems.
- even though, the commit wait period in these type of systems is very small, it is something to be mentioned as it doesn't have high throughput as some systems with extremely lower latency.
K-Sortable Unique IDs:
they consist of two parts:
- a 32 bit unsigned UTC timestamp.
- a 128 bit randomly generated payload.
the text representation is always 27 characters encoded in alphanumeric base62 that will lexicographically sort by timestamp.
- better than the uuid v4 as this contains partial time-ordering.
- better than the uuid v1 as this contains far more randomness.
- highly portable because both binary representation and text representation are sortable.
- is less complex as it doesn't require any coordination between the systems because of it's higher randomness. no collisions whatsoever.
- k represents the error which may present in the time-ordering. the actual position of an ID in the sequence may differ by at most k positions.
so, what's the catch??
- takes a very large space. 160 bits.
- has clock dependency. required accurate system clock.
the most common choices for unique ID generator in 2025 are (according to me, just based on the requirements):
- UUID v7 as it is time-ordered and also almost collision free. but, requires higher space, 128-bit and also clock synchronization across machines to maintain ordering.
- UUID v4 still a common choice i guess when you don't require any time-ordering. it has close to none collision.
- snowflake IDs. is 64-bit, quite compact. high throughput. but requires this coordination between machines and also clock synchronization for both ordering and uniqueness.
- true time api for high consistency maintaining near perfect generations. but, it is proprietary to google.
References
- https://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
- https://en.wikipedia.org/wiki/Universally_unique_identifier
- https://sookocheff.com/post/time/truetime/
- https://github.com/segmentio/ksuid
- System Design Interview - Alex Xu