tonglin0325的个人主页

kudu学习笔记——建表语句

kudu支持的数据类型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
boolean

8-bit signed integer

16-bit signed integer

32-bit signed integer

64-bit signed integer

date (32-bit days since the Unix epoch)

unixtime_micros (64-bit microseconds since the Unix epoch)

single-precision (32-bit) IEEE-754 floating-point number

double-precision (64-bit) IEEE-754 floating-point number

decimal (see Decimal Type for details)

varchar (see Varchar Type for details)

UTF-8 encoded string (up to 64KB uncompressed)

binary (up to 64KB uncompressed)

参考:Apache Kudu Schema Design

kudu的建表语句

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
CREATE TABLE sales_by_year(					# 列存储
year INT, # 有限固定列,强类型
sale_id INT COLPROPERTIES (encoding=“bitshuffle”), # 每一列均可以设置encoding及压缩方式
amount INT,
PRIMARY KEY (year,sale_id) # 主键索引
)
PARTITION BY HASH (sale_id) PARTITIONS 4,              # 哈希分区
RANGE (year)
(
PARTITION 2014 <= VALUES <= 2016, # 范围分区
PARTITION VALUE = 2017
)
STORED AS KUDU
TBLPROPERTIES (replication=3); # 多副本

或者

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
CREATE TABLE xxxx.xxxx(
uuid STRING,
ds string,
`date` string,
`ts` string,
`col1` int,
`col2` bigint,
`col3` int,
PRIMARY KEY (uuid, ds))
PARTITION BY HASH (uuid, ds) PARTITIONS 9,
RANGE(ds) (
PARTITION VALUE="2020-09-05",
PARTITION VALUE="2020-09-06",
PARTITION VALUE="2020-09-07",
PARTITION VALUE="2020-09-08",
PARTITION VALUE="2020-09-09"
)
STORED AS KUDU;

参考:DTCC2017-Kudu介绍-小米张震-final

 

支持若干种分区方式:

参考:kudu 表设计使用及限制 / Kudu table schema design and limited

1. hash 分区

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
create table kudu_test.real_time_sales_temporary_kuduc (
sdt String,
shopId String,
updateTime String,
serialId String,
sheetId String,
goodsId String,

key_by String,
timeFrame String,
regionId String,
regionName String,
shopName String,
serviceRegionId String,
serviceRegionName String,
shopBelongId String,
shopBelongName String,
primary key(sdt,shopId,updatetime,serialId,sheetid)
)
PARTITION by
hash(sdt,shopId) partitions 32
COMMENT '销售流水表C'
STORED AS kudu;

如果不指定hash分区的字段,比如

1
2
PARTITION BY HASH  PARTITIONS 3

那么默认将会使用主键来进行hash,即

1
2
PARTITION BY HASH (sdt,shopId,updatetime,serialId,sheetid) PARTITIONS 3

2. 范围分区

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
CREATE TABLE cust_behavior_table (
id BIGINT,
sku STRING,
salary STRING,
edu_level INT,
usergender STRING,
group STRING,
city STRING,
postcode STRING,
last_purchase_price FLOAT,
last_purchase_date BIGINT,
category STRING,
rating INT,
fulfilled_date BIGINT,
PRIMARY KEY (id, sku)
)
PARTITION BY RANGE (sku)
(
PARTITION VALUES < &lsquo;g&rsquo;,
PARTITION &lsquo;g&rsquo; <= VALUES < &lsquo;o&rsquo;,
PARTITION &lsquo;o&rsquo; <= VALUES < &lsquo;u&rsquo;,
PARTITION &lsquo;u&rsquo; <= VALUES
) STORED AS KUDU
TBLPROPERTIES(
&lsquo;kudu.table_name&rsquo; = &lsquo;cust_behavior_1 &lsquo;,&rsquo;kudu.master_addresses&rsquo; = &lsquo;hadoop5:7051&rsquo;);

3. 混合分区(hash + range)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
CREATE TABLE cust_behavior_1 (
id BIGINT,
sku STRING,
salary STRING,
edu_level INT,
usergender STRING,
group STRING,
city STRING,
postcode STRING,
last_purchase_price FLOAT,
last_purchase_date BIGINT,
category STRING,
rating INT,
fulfilled_date BIGINT,
PRIMARY KEY (id, sku)
)
PARTITION BY
HASH (id) PARTITIONS 4,
RANGE (sku)
(
PARTITION VALUES < &lsquo;g&rsquo;,
PARTITION &lsquo;g&rsquo; <= VALUES < &lsquo;o&rsquo;,
PARTITION &lsquo;o&rsquo; <= VALUES < &lsquo;u&rsquo;,
PARTITION &lsquo;u&rsquo; <= VALUES
) STORED AS KUDU
TBLPROPERTIES(
&lsquo;kudu.table_name&rsquo; = &lsquo;cust_behavior_1 &lsquo;,&rsquo;kudu.master_addresses&rsquo; = &lsquo;hadoop5:7051&rsquo;);