7.2.3 Iceberg Catalog
1 使用限制
-
支持 Iceberg V1/V2
表格式。
-
支持 Position Delete
。
-
2.1.3
版本开始支持 Equality Delete
。
-
支持 Parquet
文件格式
-
2.1.3
版本开始支持 ORC
文件格式。
2 创建 Catalog
和 Hive Catalog
基本一致,这里仅给出简单示例。其他示例可参阅 Hive Catalog
。
SQL |
---|
| CREATE CATALOG iceberg PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.username' = 'hive',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
|
2.2 基于 Iceberg API 创建 Catalog
使用 Iceberg API
访问元数据的方式,支持 Hadoop File System
、 Hive
、 REST
、 Glue
、 DLF
等服务作为 Iceberg
的 Catalog
。
2.2.1 Hadoop Catalog
Tip
注意: warehouse
的路径必须指向 Database
路径的上一级。
示例:如果你的表路径是 s3://bucket/path/to/db1/table1
,那么 warehouse
应该是: s3://bucket/path/to/
SQL |
---|
| CREATE CATALOG iceberg_hadoop PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'hdfs://your-host:8020/dir/key'
);
|
SQL |
---|
| CREATE CATALOG iceberg_hadoop_ha PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'hdfs://your-nameservice/dir/key',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
|
SQL |
---|
| CREATE CATALOG iceberg_s3 PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 's3://bucket/dir/key',
's3.endpoint' = 's3.us-east-1.amazonaws.com',
's3.access_key' = 'ak',
's3.secret_key' = 'sk'
);
|
SQL |
---|
| CREATE CATALOG iceberg PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.username' = 'hive',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
|
2.2.3 AWS Glue
连接 Glue
时,如果是在非 EC2
环境,需要将 EC2
环境里的 ~/.aws
目录拷贝到当前环境里。也可以下载 AWS Cli
工具进行配置,这种方式也会在当前用户目录下创建 .aws
目录。请升级到 Doris 2.1.7
或 3.0.3
之后的版本使用该功能。
SQL |
---|
| -- Using access key and secret key
CREATE CATALOG glue2 PROPERTIES (
"type"="iceberg",
"iceberg.catalog.type" = "glue",
"glue.endpoint" = "<https://glue.us-east-1.amazonaws.com/>",
"client.credentials-provider" = "com.amazonaws.glue.catalog.credentials.ConfigAWSProvider",
"client.credentials-provider.glue.access_key" = "ak",
"client.credentials-provider.glue.secret_key" = "sk"
);
|
-
Iceberg
属性详情参见 Iceberg Glue Catalog
-
如果不指定 client.credentials-provider
, Doris
就会使用默认的 DefaultAWSCredentialsProviderChain
,它会读取系统环境变量或者 InstanceProfile
中配置的属性。
2.2.4 阿里云 DLF
参见阿里云 DLF Catalog
配置
2.2.5 REST Catalog
该方式需要预先提供 REST
服务,用户需实现获取 Iceberg
元数据的 REST
接口。
SQL |
---|
| CREATE CATALOG iceberg PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type'='rest',
'uri' = '<http://172.21.0.1:8181>'
);
|
如果使用 HDFS
存储数据,并开启了高可用模式,还需在 Catalog
中增加 HDFS
高可用配置:
SQL |
---|
| CREATE CATALOG iceberg PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type'='rest',
'uri' = '<http://172.21.0.1:8181>',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.1:8020',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.2:8020',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
|
SQL |
---|
| CREATE CATALOG iceberg PROPERTIES (
"type"="iceberg",
"iceberg.catalog.type"="hms",
"hive.metastore.uris" = "thrift://172.21.0.1:9083",
"gs.endpoint" = "<https://storage.googleapis.com>",
"gs.region" = "us-east-1",
"gs.access_key" = "ak",
"gs.secret_key" = "sk",
"use_path_style" = "true"
);
|
hive.metastore.uris
: Dataproc Metastore
服务开放的接口,在 Metastore
管理页面获取: Dataproc Metastore Services
。
2.3 Iceberg On Object Storage
若数据存放在 S3
上, properties
中可以使用以下参数:
Bash |
---|
| "s3.access_key" = "ak"
"s3.secret_key" = "sk"
"s3.endpoint" = "s3.us-east-1.amazonaws.com"
"s3.region" = "us-east-1"
|
数据存放在阿里云 OSS
上:
Bash |
---|
| "oss.access_key" = "ak"
"oss.secret_key" = "sk"
"oss.endpoint" = "oss-cn-beijing-internal.aliyuncs.com"
"oss.region" = "oss-cn-beijing"
|
数据存放在腾讯云 COS
上:
Bash |
---|
| "cos.access_key" = "ak"
"cos.secret_key" = "sk"
"cos.endpoint" = "cos.ap-beijing.myqcloud.com"
"cos.region" = "ap-beijing"
|
数据存放在华为云 OBS
上:
Bash |
---|
| "obs.access_key" = "ak"
"obs.secret_key" = "sk"
"obs.endpoint" = "obs.cn-north-4.myhuaweicloud.com"
"obs.region" = "cn-north-4"
|
3 示例
SQL |
---|
1
2
3
4
5
6
7
8
9
10
11
12 | -- MinIO & Rest Catalog
CREATE CATALOG `iceberg` PROPERTIES (
"type" = "iceberg",
"iceberg.catalog.type" = "rest",
"uri" = "<http://10.0.0.1:8181>",
"warehouse" = "s3://bucket",
"token" = "token123456",
"s3.access_key" = "ak",
"s3.secret_key" = "sk",
"s3.endpoint" = "<http://10.0.0.1:9000>",
"s3.region" = "us-east-1"
);
|
4 列类型映射
Iceberg Type |
Doris Type |
boolean |
boolean |
int |
int |
long |
bigint |
float |
float |
double |
double |
decimal(p,s) |
decimal(p,s) |
date |
date |
uuid |
string |
timestamp (Timestamp without timezone) |
datetime(6) |
timestamptz (Timestamp with timezone) |
datetime(6) |
string |
string |
fixed(L) |
char(L) |
binary |
string |
struct |
struct(2.1.3 版本开始支持) |
map |
map(2.1.3 版本开始支持) |
list |
array |
time |
不支持 |
5 Time Travel
支持读取 Iceberg
表指定的 Snapshot
。
每一次对 iceberg
表的写操作都会产生一个新的快照。
默认情况下,读取请求只会读取最新版本的快照。
可以使用 FOR TIME AS OF
和 FOR VERSION AS OF
语句,根据快照 ID
或者快照产生的时间读取历史版本的数据。示例如下:
SQL |
---|
| SELECT * FROM iceberg_tbl FOR TIME AS OF "2022-10-07 17:20:37";
SELECT * FROM iceberg_tbl FOR VERSION AS OF 868895038966572;
|
另外,可以使用 iceberg_meta
表函数查询指定表的 snapshot
信息。