Design a Unique ID Generator

Overview

Design a unique ID generator that works in a distributed environment
A traditional database with an auto_increment primary key is difficult to scale

Questions:

q: What are the requirements for our unique IDs?
- a: must be unique and sortable
q: Do our IDs increment by 1 for each new record?
- a: IDs increment by time (IDs generated earlier are smaller)
q: Do IDs only contain numbers?
- a: Yes
q: What is the ID length requirement?
- a: IDs should fit into 64 bit

We will evaluate the following options:

Auto increment each new ID by the previous_id in current master + k (number of master databases, 2 in this example)
flowchart LR SQL_1[SQL db 1]--"1,3,5,..."-->b[Web servers] SQL_2[SQL db 2]--"2,4,6,..."-->b
Cons:
- Doesn’t respond well to servers being added or removed
- IDs across the system can’t be sorted by time
- Difficult to scale with multiple data centers

UUIDs (128 bit label) are an easy way to create unique IDs in a distributed environment with no coordination needed
Example UUID: 62143e84-86ba-11ed-9163-0242ac1c000c

Pros	Cons
✅ No coordination needed	❌ `UUIDv6 & below` can’t be sorted by time. (`UUIDv7` is sortable by time)
✅ Easily scalable	❌ `128 bit` > `64 bit`
	❌ not a number

Developed by Flikr to generate distributed primary keys
flowchart LR TS[Ticket Server 1]-->WS1[Web server] TS-->WS2[Web server] TS-->WS3[Web server]
Centralized auto_increment feature in a single database server

Pros	Cons
✅ Numeric IDs	❌ Single point of failure (ticket server)
✅ Easy to implement	❌ Multiple ticket server implement would experience synchronization issues

Employ a divide & conquer strategy to split the ID into 5 different segments
flowchart TD a["sign bit (1 bit)"]---b["timestamp (41 b its)"] b---c["data center id (5 bits)"] c---d["machine id (5 bits)"] d---e["sequence number (12 bits)"]
Sign bit
- sign bit for future uses (0 for now)
Timestamp
- 41 bits –> 2^41 - 1 = 2199023255551 milliseconds
- Time starts from Twitter snowflake default epic Nov 04, 2010, 01:42:54 UTC, UNIX timestamp 1288834974657
- 41 bit Timestamp = 1288834974657 + (0 –> 2199023255551)
- This means that there is room for approximately 69 years of datestamps from the Twitter snowflake default epic
  - 2199023255551 / (3600 * 24 * 1000 * 365) = 69.73 years
  - Timestamps will need to be migrated to another method after this time has elapsed
Data center ID
- 2^5 (32) data centers
Machine ID
- 2^5 (32) machines / data center
Sequence Number
- 2^12 = 4096 possible sequence numbers per millisecond starting from 1
- The sequence number is reset to 0 each millisecond

The Twitter snowflake approach assumed that servers had the same clock. In practices, clocks on difference servers need to by synchronized. Network time protocol (NTP) is a popular choice for clock synchronization
If the application is expected to survive for a long time and low concurrency is acceptable, the sequence number bit size can be reduced and the timestamp can be increased
- NTP can usually maintain synchronization within 10s of milliseconds on the open internet and less than 1 ms over a LAN. Asymmetric routes and network congestion can cause error > 100 ms
- NTP sends/received timestamps over UDP on port 123
- NTP does not send timezone or daylight savings time data
The system must be highly available