Iceberg
Iceberg
使用限制
- 支持 Iceberg V1/V2 表格式。
- V2 格式仅支持 Position Delete 方式,不支持 Equality Delete。
创建 Catalog
基于Hive Metastore创建Catalog
和 Hive Catalog 基本一致,这里仅给出简单示例。其他示例可参阅 Hive Catalog。
CREATE CATALOG iceberg PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.username' = 'hive',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
基于Iceberg API创建Catalog
使用Iceberg API访问元数据的方式,支持Hadoop File System、Hive、REST、Glue、DLF等服务作为Iceberg的Catalog。
Hadoop Catalog
注意:
warehouse
的路径必须指向Database
路径的上一级。示例:如果你的表路径是:
s3://bucket/path/to/db1/table1
,那么warehouse
应该是:s3://bucket/path/to/
CREATE CATALOG iceberg_hadoop PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'hdfs://your-host:8020/dir/key'
);
CREATE CATALOG iceberg_hadoop_ha PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'hdfs://your-nameservice/dir/key',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
CREATE CATALOG iceberg_s3 PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 's3://bucket/dir/key',
's3.endpoint' = 's3.us-east-1.amazonaws.com',
's3.access_key' = 'ak',
's3.secret_key' = 'sk'
);
Hive Metastore
CREATE CATALOG iceberg PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.username' = 'hive',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
AWS Glue
连接Glue时,如果是在非EC2环境,需要将EC2环境里的
~/.aws
目录拷贝到当前环境里。也可以下载AWS Cli工具进行配置,这种方式也会在当前用户目录下创建.aws
目录。
CREATE CATALOG glue PROPERTIES (
"type"="iceberg",
"iceberg.catalog.type" = "glue",
"glue.endpoint" = "https://glue.us-east-1.amazonaws.com",
"glue.access_key" = "ak",
"glue.secret_key" = "sk"
);
Iceberg 属性详情参见 Iceberg Glue Catalog
如果在AWS服务(如EC2)中,不填写Credentials相关信息(
glue.access_key
和glue.secret_key
),Doris就会使用默认的DefaultAWSCredentialsProviderChain,它会读取系统环境变量或者InstanceProfile中配置的属性。
阿里云 DLF
REST Catalog
该方式需要预先提供REST服务,用户需实现获取Iceberg元数据的REST接口。
CREATE CATALOG iceberg PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type'='rest',
'uri' = 'http://172.21.0.1:8181'
);
如果使用HDFS存储数据,并开启了高可用模式,还需在Catalog中增加HDFS高可用配置:
CREATE CATALOG iceberg PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type'='rest',
'uri' = 'http://172.21.0.1:8181',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.1:8020',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.2:8020',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
Google Dataproc Metastore
CREATE CATALOG iceberg PROPERTIES (
"type"="iceberg",
"iceberg.catalog.type"="hms",
"hive.metastore.uris" = "thrift://172.21.0.1:9083",
"gs.endpoint" = "https://storage.googleapis.com",
"gs.region" = "us-east-1",
"gs.access_key" = "ak",
"gs.secret_key" = "sk",
"use_path_style" = "true"
);
hive.metastore.uris
: Dataproc Metastore 服务开放的接口,在 Metastore 管理页面获取 :Dataproc Metastore Services.
Iceberg On Object Storage
若数据存放在S3上,properties中可以使用以下参数:
"s3.access_key" = "ak"
"s3.secret_key" = "sk"
"s3.endpoint" = "s3.us-east-1.amazonaws.com"
"s3.region" = "us-east-1"
数据存放在阿里云OSS上:
"oss.access_key" = "ak"
"oss.secret_key" = "sk"
"oss.endpoint" = "oss-cn-beijing-internal.aliyuncs.com"
"oss.region" = "oss-cn-beijing"
数据存放在腾讯云COS上:
"cos.access_key" = "ak"
"cos.secret_key" = "sk"
"cos.endpoint" = "cos.ap-beijing.myqcloud.com"
"cos.region" = "ap-beijing"
数据存放在华为云OBS上:
"obs.access_key" = "ak"
"obs.secret_key" = "sk"
"obs.endpoint" = "obs.cn-north-4.myhuaweicloud.com"
"obs.region" = "cn-north-4"
列类型映射
和 Hive Catalog 一致,可参阅 Hive Catalog 中 列类型映射 一节。
Time Travel
支持读取 Iceberg 表指定的 Snapshot。
每一次对iceberg表的写操作都会产生一个新的快照。
默认情况下,读取请求只会读取最新版本的快照。
可以使用 FOR TIME AS OF
和 FOR VERSION AS OF
语句,根据快照 ID 或者快照产生的时间读取历史版本的数据。示例如下:
SELECT * FROM iceberg_tbl FOR TIME AS OF "2022-10-07 17:20:37";
SELECT * FROM iceberg_tbl FOR VERSION AS OF 868895038966572;
另外,可以使用 iceberg_meta 表函数查询指定表的 snapshot 信息。