Skip to content

Commit b5ed379

Browse files
committed
update paimon
1 parent f92fa2c commit b5ed379

2 files changed

Lines changed: 115 additions & 57 deletions

File tree

website/docs/maintenance/tiered-storage/lakehouse-storage.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ datalake.paimon.metastore: filesystem
3535
datalake.paimon.warehouse: /tmp/paimon
3636
```
3737
38-
Fluss processes Paimon configurations by removing the `datalake.paimon.` prefix and then use the remaining configuration (without the prefix `datalake.paimon.`) to create the Paimon catalog. Checkout the [Paimon documentation](https://paimon.apache.org/docs/1.3/maintenance/configurations/) for more details on the available configurations.
38+
Fluss processes Paimon configurations by removing the `datalake.paimon.` prefix and then use the remaining configuration (without the prefix `datalake.paimon.`) to create the Paimon catalog. Checkout the [Paimon documentation](https://paimon.apache.org/docs/$PAIMON_VERSION_SHORT$/maintenance/configurations/) for more details on the available configurations.
3939

4040
For example, if you want to configure to use Hive catalog, you can configure like following:
4141
```yaml
@@ -66,7 +66,7 @@ Then, you must start the datalake tiering service to tier Fluss's data to the la
6666
- Put [fluss-lake-paimon jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-paimon/$FLUSS_VERSION$/fluss-lake-paimon-$FLUSS_VERSION$.jar) into `${FLINK_HOME}/lib`
6767
- Put [paimon-bundle jar](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-bundle/$PAIMON_VERSION$/paimon-bundle-$PAIMON_VERSION$.jar) into `${FLINK_HOME}/lib`
6868
- [Download](https://flink.apache.org/downloads/) pre-bundled Hadoop jar `flink-shaded-hadoop-2-uber-*.jar` and put into `${FLINK_HOME}/lib`
69-
- Put Paimon's [filesystem jar](https://paimon.apache.org/docs/1.3/project/download/) into `${FLINK_HOME}/lib`, if you use s3 to store paimon data, please put `paimon-s3` jar into `${FLINK_HOME}/lib`
69+
- Put Paimon's [filesystem jar](https://paimon.apache.org/docs/$PAIMON_VERSION_SHORT$/project/download/) into `${FLINK_HOME}/lib`, if you use s3 to store paimon data, please put `paimon-s3` jar into `${FLINK_HOME}/lib`
7070
- The other jars that Paimon may require, for example, if you use HiveCatalog, you will need to put hive related jars
7171

7272

website/docs/quickstart/lakehouse.md

Lines changed: 113 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -32,12 +32,39 @@ mkdir fluss-quickstart-paimon
3232
cd fluss-quickstart-paimon
3333
```
3434

35-
2. Create a `docker-compose.yml` file with the following content:
35+
2. Create directories and download required jars:
36+
37+
```shell
38+
mkdir -p lib opt
39+
40+
# Flink connectors
41+
wget -O lib/flink-faker-0.5.3.jar https://github.com/knaufk/flink-faker/releases/download/v0.5.3/flink-faker-0.5.3.jar
42+
wget -O "lib/fluss-flink-1.20-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-1.20/$FLUSS_DOCKER_VERSION$/fluss-flink-1.20-$FLUSS_DOCKER_VERSION$.jar"
43+
wget -O "lib/paimon-flink-1.20-$PAIMON_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/paimon/paimon-flink-1.20/$PAIMON_VERSION$/paimon-flink-1.20-$PAIMON_VERSION$.jar"
44+
45+
# Fluss lake plugin
46+
wget -O "lib/fluss-lake-paimon-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-paimon/$FLUSS_DOCKER_VERSION$/fluss-lake-paimon-$FLUSS_DOCKER_VERSION$.jar"
47+
48+
# Paimon bundle jar
49+
wget -O "lib/paimon-bundle-$PAIMON_VERSION$.jar" "https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-bundle/$PAIMON_VERSION$/paimon-bundle-$PAIMON_VERSION$.jar"
50+
51+
# Hadoop bundle jar
52+
wget -O lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.8.3-10.0/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar
53+
54+
# Tiering service
55+
wget -O "opt/fluss-flink-tiering-$FLUSS_DOCKER_VERSION$.jar" "https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-tiering/$FLUSS_DOCKER_VERSION$/fluss-flink-tiering-$FLUSS_DOCKER_VERSION$.jar"
56+
```
3657

58+
:::info
59+
You can add more jars to this `lib` directory based on your requirements:
60+
- **Cloud storage support**: For AWS S3 integration with Paimon, add the corresponding [paimon-s3](https://repo.maven.apache.org/maven2/org/apache/paimon/paimon-s3/$PAIMON_VERSION$/paimon-s3-$PAIMON_VERSION$.jar)
61+
- **Other catalog backends**: Add jars needed for alternative Paimon catalog implementations (e.g., Hive, JDBC)
62+
:::
63+
64+
3. Create a `docker-compose.yml` file with the following content:
3765

3866
```yaml
3967
services:
40-
#begin Fluss cluster
4168
coordinator-server:
4269
image: apache/fluss:$FLUSS_DOCKER_VERSION$
4370
command: coordinatorServer
@@ -54,6 +81,7 @@ services:
5481
datalake.paimon.warehouse: /tmp/paimon
5582
volumes:
5683
- shared-tmpfs:/tmp/paimon
84+
- shared-tmpfs:/tmp/fluss
5785
tablet-server:
5886
image: apache/fluss:$FLUSS_DOCKER_VERSION$
5987
command: tabletServer
@@ -72,37 +100,50 @@ services:
72100
datalake.paimon.warehouse: /tmp/paimon
73101
volumes:
74102
- shared-tmpfs:/tmp/paimon
103+
- shared-tmpfs:/tmp/fluss
75104
zookeeper:
76105
restart: always
77106
image: zookeeper:3.9.2
78-
#end
79-
#begin Flink cluster
80107
jobmanager:
81-
image: apache/fluss-quickstart-flink:1.20-$FLUSS_DOCKER_VERSION$
108+
image: flink:1.20-scala_2.12-java17
82109
ports:
83110
- "8083:8081"
84-
command: jobmanager
111+
entrypoint: ["/bin/bash", "-c"]
112+
command: >
113+
"sed -i 's/exec $(drop_privs_cmd)//g' /docker-entrypoint.sh &&
114+
cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true;
115+
cp /tmp/opt/*.jar /opt/flink/opt/ 2>/dev/null || true;
116+
/docker-entrypoint.sh jobmanager"
85117
environment:
86118
- |
87119
FLINK_PROPERTIES=
88120
jobmanager.rpc.address: jobmanager
89121
volumes:
90122
- shared-tmpfs:/tmp/paimon
123+
- shared-tmpfs:/tmp/fluss
124+
- ./lib:/tmp/jars
125+
- ./opt:/tmp/opt
91126
taskmanager:
92-
image: apache/fluss-quickstart-flink:1.20-$FLUSS_DOCKER_VERSION$
127+
image: flink:1.20-scala_2.12-java17
93128
depends_on:
94129
- jobmanager
95-
command: taskmanager
130+
entrypoint: ["/bin/bash", "-c"]
131+
command: >
132+
"sed -i 's/exec $(drop_privs_cmd)//g' /docker-entrypoint.sh &&
133+
cp /tmp/jars/*.jar /opt/flink/lib/ 2>/dev/null || true;
134+
cp /tmp/opt/*.jar /opt/flink/opt/ 2>/dev/null || true;
135+
/docker-entrypoint.sh taskmanager"
96136
environment:
97137
- |
98138
FLINK_PROPERTIES=
99139
jobmanager.rpc.address: jobmanager
100140
taskmanager.numberOfTaskSlots: 10
101141
taskmanager.memory.process.size: 2048m
102-
taskmanager.memory.framework.off-heap.size: 256m
103142
volumes:
104143
- shared-tmpfs:/tmp/paimon
105-
#end
144+
- shared-tmpfs:/tmp/fluss
145+
- ./lib:/tmp/jars
146+
- ./opt:/tmp/opt
106147

107148
volumes:
108149
shared-tmpfs:
@@ -116,11 +157,7 @@ The Docker Compose environment consists of the following containers:
116157
- **Fluss Cluster:** a Fluss `CoordinatorServer`, a Fluss `TabletServer` and a `ZooKeeper` server.
117158
- **Flink Cluster**: a Flink `JobManager` and a Flink `TaskManager` container to execute queries.
118159

119-
**Note:** The `apache/fluss-quickstart-flink` image is based on [flink:1.20.3-java17](https://hub.docker.com/layers/library/flink/1.20-java17/images/sha256:296c7c23fa40a9a3547771b08fc65e25f06bc4cfd3549eee243c99890778cafc) and
120-
includes the [fluss-flink](engine-flink/getting-started.md), [paimon-flink](https://paimon.apache.org/docs/1.3/flink/quick-start/) and
121-
[flink-connector-faker](https://flink-packages.org/packages/flink-faker) to simplify this guide.
122-
123-
3. To start all containers, run:
160+
4. To start all containers, run:
124161
```shell
125162
docker compose up -d
126163
```
@@ -312,23 +349,69 @@ Congratulations, you are all set!
312349

313350
First, use the following command to enter the Flink SQL CLI Container:
314351
```shell
315-
docker compose exec jobmanager ./sql-client
352+
docker compose exec jobmanager ./bin/sql-client.sh
316353
```
317354

318-
**Note**:
319-
To simplify this guide, three temporary tables have been pre-created with `faker` connector to generate data.
320-
You can view their schemas by running the following commands:
355+
To simplify this guide, we will create three temporary tables with `faker` connector to generate data:
356+
357+
```sql title="Flink SQL"
358+
CREATE TEMPORARY TABLE source_order (
359+
`order_key` BIGINT,
360+
`cust_key` INT,
361+
`total_price` DECIMAL(15, 2),
362+
`order_date` DATE,
363+
`order_priority` STRING,
364+
`clerk` STRING
365+
) WITH (
366+
'connector' = 'faker',
367+
'rows-per-second' = '10',
368+
'number-of-rows' = '10000',
369+
'fields.order_key.expression' = '#{number.numberBetween ''0'',''100000000''}',
370+
'fields.cust_key.expression' = '#{number.numberBetween ''0'',''20''}',
371+
'fields.total_price.expression' = '#{number.randomDouble ''3'',''1'',''1000''}',
372+
'fields.order_date.expression' = '#{date.past ''100'' ''DAYS''}',
373+
'fields.order_priority.expression' = '#{regexify ''(low|medium|high){1}''}',
374+
'fields.clerk.expression' = '#{regexify ''(Clerk1|Clerk2|Clerk3|Clerk4){1}''}'
375+
);
376+
```
321377

322378
```sql title="Flink SQL"
323-
SHOW CREATE TABLE source_customer;
379+
CREATE TEMPORARY TABLE source_customer (
380+
`cust_key` INT,
381+
`name` STRING,
382+
`phone` STRING,
383+
`nation_key` INT NOT NULL,
384+
`acctbal` DECIMAL(15, 2),
385+
`mktsegment` STRING,
386+
PRIMARY KEY (`cust_key`) NOT ENFORCED
387+
) WITH (
388+
'connector' = 'faker',
389+
'number-of-rows' = '200',
390+
'fields.cust_key.expression' = '#{number.numberBetween ''0'',''20''}',
391+
'fields.name.expression' = '#{funnyName.name}',
392+
'fields.nation_key.expression' = '#{number.numberBetween ''1'',''5''}',
393+
'fields.phone.expression' = '#{phoneNumber.cellPhone}',
394+
'fields.acctbal.expression' = '#{number.randomDouble ''3'',''1'',''1000''}',
395+
'fields.mktsegment.expression' = '#{regexify ''(AUTOMOBILE|BUILDING|FURNITURE|MACHINERY|HOUSEHOLD){1}''}'
396+
);
324397
```
325398

326399
```sql title="Flink SQL"
327-
SHOW CREATE TABLE source_order;
400+
CREATE TEMPORARY TABLE `source_nation` (
401+
`nation_key` INT NOT NULL,
402+
`name` STRING,
403+
PRIMARY KEY (`nation_key`) NOT ENFORCED
404+
) WITH (
405+
'connector' = 'faker',
406+
'number-of-rows' = '100',
407+
'fields.nation_key.expression' = '#{number.numberBetween ''1'',''5''}',
408+
'fields.name.expression' = '#{regexify ''(CANADA|JORDAN|CHINA|UNITED|INDIA){1}''}'
409+
);
328410
```
329411

330412
```sql title="Flink SQL"
331-
SHOW CREATE TABLE source_nation;
413+
-- drop records silently if a null value would have to be inserted into a NOT NULL column
414+
SET 'table.exec.sink.not-null-enforcer'='DROP';
332415
```
333416

334417
</TabItem>
@@ -635,10 +718,6 @@ CREATE TABLE datalake_enriched_orders (
635718
```
636719

637720
Next, perform streaming data writing into the **datalake-enabled** table, `datalake_enriched_orders`:
638-
```sql title="Flink SQL"
639-
-- switch to streaming mode
640-
SET 'execution.runtime-mode' = 'streaming';
641-
```
642721

643722
```sql title="Flink SQL"
644723
-- insert tuples into datalake_enriched_orders
@@ -674,9 +753,15 @@ The data for the `datalake_enriched_orders` table is stored in Fluss (for real-t
674753
When querying the `datalake_enriched_orders` table, Fluss uses a union operation that combines data from both Fluss and Paimon to provide a complete result set -- combines **real-time** and **historical** data.
675754

676755
If you wish to query only the data stored in Paimon—offering high-performance access without the overhead of unioning data—you can use the `datalake_enriched_orders$lake` table by appending the `$lake` suffix.
677-
This approach also enables all the optimizations and features of a Flink Paimon table source, including [system table](https://paimon.apache.org/docs/1.3/concepts/system-tables/) such as `datalake_enriched_orders$lake$snapshots`.
756+
This approach also enables all the optimizations and features of a Flink Paimon table source, including [system table](https://paimon.apache.org/docs/$PAIMON_VERSION_SHORT$/concepts/system-tables/) such as `datalake_enriched_orders$lake$snapshots`.
678757

679758
To query the snapshots directly from Paimon, use the following SQL:
759+
760+
```sql title="Flink SQL"
761+
-- use tableau result mode
762+
SET 'sql-client.execution.result-mode' = 'tableau';
763+
```
764+
680765
```sql title="Flink SQL"
681766
-- switch to batch mode
682767
SET 'execution.runtime-mode' = 'batch';
@@ -726,33 +811,7 @@ The result looks like:
726811
```
727812
You can execute the real-time analytics query multiple times, and the results will vary with each run as new data is continuously written to Fluss in real-time.
728813

729-
Finally, you can use the following command to view the files stored in Paimon:
730-
```shell
731-
docker compose exec taskmanager tree /tmp/paimon/fluss.db
732-
```
733-
734-
**Sample Output:**
735-
```shell
736-
/tmp/paimon/fluss.db
737-
└── datalake_enriched_orders
738-
├── bucket-0
739-
│ ├── changelog-aef1810f-85b2-4eba-8eb8-9b136dec5bdb-0.orc
740-
│ └── data-aef1810f-85b2-4eba-8eb8-9b136dec5bdb-1.orc
741-
├── manifest
742-
│ ├── manifest-aaa007e1-81a2-40b3-ba1f-9df4528bc402-0
743-
│ ├── manifest-aaa007e1-81a2-40b3-ba1f-9df4528bc402-1
744-
│ ├── manifest-list-ceb77e1f-7d17-4160-9e1f-f334918c6e0d-0
745-
│ ├── manifest-list-ceb77e1f-7d17-4160-9e1f-f334918c6e0d-1
746-
│ └── manifest-list-ceb77e1f-7d17-4160-9e1f-f334918c6e0d-2
747-
├── schema
748-
│ └── schema-0
749-
└── snapshot
750-
├── EARLIEST
751-
├── LATEST
752-
└── snapshot-1
753-
```
754-
755-
The files adhere to Paimon's standard format, enabling seamless querying with other engines such as [Spark](https://paimon.apache.org/docs/1.3/spark/quick-start/) and [Trino](https://paimon.apache.org/docs/1.3/ecosystem/trino/).
814+
The files adhere to Paimon's standard format, enabling seamless querying with other engines such as [Spark](https://paimon.apache.org/docs/$PAIMON_VERSION_SHORT$/spark/quick-start/) and [Trino](https://paimon.apache.org/docs/$PAIMON_VERSION_SHORT$/ecosystem/trino/).
756815

757816
</TabItem>
758817

@@ -776,7 +835,6 @@ SET 'sql-client.execution.result-mode' = 'tableau';
776835
SET 'execution.runtime-mode' = 'batch';
777836
```
778837

779-
780838
```sql title="Flink SQL"
781839
-- query snapshots in iceberg
782840
SELECT snapshot_id, operation FROM datalake_enriched_orders$lake$snapshots;

0 commit comments

Comments
 (0)