Skip to content

Commit c3a7966

Browse files
Create snowflakeid
1 parent 688abf6 commit c3a7966

1 file changed

Lines changed: 122 additions & 0 deletions

File tree

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
---
2+
title: "SnowflakeID for Efficient Primary Keys "
3+
linkTitle: "SnowflakeID"
4+
weight: 100
5+
description: >-
6+
SnowflakeID for Efficient Primary Keys
7+
---
8+
9+
In data warehousing (DWH) environments, the choice of primary key (PK) can significantly impact performance, particularly in terms of RAM usage and query speed. This is where [SnowflakeID](https://en.wikipedia.org/wiki/Snowflake_ID) comes into play, providing a robust solution for PK management. Here’s a deep dive into why and how Snowflake IDs are beneficial and practical implementation examples.
10+
11+
### Why Snowflake ID?
12+
13+
- **Natural IDs Suck**: Natural keys derived from business data can lead to various issues like complexity and instability. Surrogate keys, on the other hand, are system-generated and stable.
14+
- Surrogate keys simplify joins and indexing, which is crucial for performance in large-scale data warehousing.
15+
- Monotonic or sequential IDs help maintain the order of entries, which is essential for performance tuning and efficient range queries.
16+
- Having both a timestamp and a unique ID in the same column allows for fast filtering of rows during SELECT operations. This is particularly useful for time-series data.
17+
18+
### **Building Snowflake IDs**
19+
20+
There are two primary methods to construct the lower bits of a Snowflake ID:
21+
22+
1. **Hash of Important Columns**:
23+
24+
Using a hash function on significant columns ensures uniqueness and distribution.
25+
26+
2. **Row Number in insert batch**
27+
28+
Utilizing the row number within data blocks provides a straightforward approach to generating unique identifiers.
29+
30+
### **Implementation as UDF**
31+
32+
Here’s how to implement Snowflake IDs using standard SQL functions while utilizing second and millisecond timestamps.
33+
34+
Pack hash to lower 22 bits for DateTime64 and 32bits for DateTime
35+
36+
```sql
37+
create function toSnowflake64 as (dt,ch) ->
38+
bitOr(dateTime64ToSnowflakeID(dt),
39+
bitAnd(bitAnd(ch,0x3FFFFF)+
40+
bitAnd(bitShiftRight(ch, 20),0x3FFFFF)+
41+
bitAnd(bitShiftRight(ch, 40),0x3FFFFF),
42+
0x3FFFFF)
43+
);
44+
45+
create function toSnowflake as (dt,ch) ->
46+
bitOr(dateTimeToSnowflakeID(dt),
47+
bitAnd(bitAnd(ch,0xFFFFFFFF)+
48+
bitAnd(bitShiftRight(ch, 32),0xFFFFFFFF),
49+
0xFFFFFFFF)
50+
);
51+
52+
with cityHash64('asdfsdnfs;n') as ch,
53+
now64() as dt
54+
select dt,
55+
hex(toSnowflake64(dt,ch) as sn) ,
56+
snowflakeIDToDateTime64(sn);
57+
58+
with cityHash64('asdfsdnfs;n') as ch,
59+
now() as dt
60+
select dt,
61+
hex(toSnowflake(dt,ch) as sn) ,
62+
snowflakeIDToDateTime(sn);
63+
```
64+
65+
### **Creating Tables with Snowflake ID**
66+
67+
**Using Materialized Columns and hash**
68+
69+
```sql
70+
create table XX
71+
(
72+
id Int64 materialized toSnowflake(now(),cityHash64(oldID)),
73+
oldID String,
74+
data String
75+
) engine=MergeTree order by id;
76+
77+
```
78+
79+
Note: Using User-Defined Functions (UDFs) in CREATE TABLE statements is not always useful, as they expand to create table DDL, and changing them is inconvenient.
80+
81+
**Using a Null Table, Materialized View, and** rowNumberInAllBlocks
82+
83+
A more efficient approach involves using a Null table and materialized views.
84+
85+
```sql
86+
create table XX
87+
(
88+
id Int64,
89+
data String
90+
) engine=MergeTree order by id;
91+
92+
create table Null (data String) engine=Null;
93+
create materialized view _XX to XX as
94+
select toSnowflake(now(),rowNumberInAllBlocks()) is id, data
95+
from Null;
96+
```
97+
98+
### Converting from UUID to SnowFlakeID for subsequent events
99+
100+
Consider that your event stream only has a UUID column identifying a particular user. Registration time that can be used as a base for SnowFlakeID is presented only in the first ‘register’ event, but not in subsequent events. It’s easy to generate SnowFlakeID for the register event, but next, we need to get it from some other table without disturbing the ingestion process too much. Using Hash JOINs in Materialized Views is not recommended, so we need some “nested loop join” to get data fast. In Clickhouse, the “nested loop join” is still not supported, but Direct Dictionary can work around it.
101+
102+
```sql
103+
CREATE TABLE UUID2ID_store (user_id UUID, id UInt64)
104+
ENGINE = MergeTree() -- EmbeddedRocksDB can be used instead
105+
ORDER BY user_id
106+
settings index_granularity=256;
107+
108+
CREATE DICTIONARY UUID2ID_dict (user_id UUID, id UInt64)
109+
PRIMARY KEY user_id
110+
LAYOUT ( DIRECT ())
111+
SOURCE(CLICKHOUSE(TABLE 'UUID2ID_store'));
112+
113+
CREATE OR REPLACE FUNCTION UUID2ID AS (uuid) -> dictGet('UUID2ID_dict',id,uuid);
114+
115+
CREATE MATERIALIZED VIEW _toUUID_store TO UUID2ID_store AS
116+
select user_id, toSnowflake64(event_time, cityHash64(user_id)) as id
117+
from Actions;
118+
```
119+
120+
**Conclusion**
121+
122+
Snowflake IDs provide an efficient mechanism for generating unique, monotonic primary keys, which are essential for optimizing query performance in data warehousing environments. By combining timestamps and unique identifiers, snowflake IDs facilitate faster row filtering and ensure stable, surrogate key generation. Implementing these IDs using SQL functions and materialized views ensures that your data warehouse remains performant and scalable.

0 commit comments

Comments
 (0)