tonglin0325的个人主页

elephant-bird学习笔记

elephant-bird是Twitter的开源项目,项目的地址为 https://github.com/twitter/elephant-bird

该项目是Twitter为LZO,thrift,protocol buffer相关的hadoop InputFormats, OutputFormats, Writables, Pig加载函数, Hive SerDe, HBase二级索引等编写的库

1
2
mvn clean install -U -Dprotobuf.version=2.5.0 -DskipTests=true

mvn package的时候需要签名

1
2
gpg --gen-key

以及需要安装apache Thrift和Protocol Buffers

thrift安装参考

1
2
https://www.cnblogs.com/tonglin0325/p/10190050.html

PB安装参考

1
2
https://www.cnblogs.com/tonglin0325/p/13685527.html

 

使用elephant-bird来建hive表的类型对应关系

该表的表结构不用显示定义,将会自动从’serialization.class’=’com.xxx.xxx.xxx’中自动反序列化出来

参考:https://github.com/twitter/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive

 建表之后show create table的结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
CREATE EXTERNAL TABLE `xxxx`(
`ts` string COMMENT 'from deserializer',
`schema` string COMMENT 'from deserializer',
`test_string` string COMMENT 'from deserializer',
`test_long` bigint COMMENT 'from deserializer',
`test_int` int COMMENT 'from deserializer',
`test_short` smallint COMMENT 'from deserializer',
`test_double` double COMMENT 'from deserializer',
`test_byte` tinyint COMMENT 'from deserializer',
`test_bool` boolean COMMENT 'from deserializer',
`test_list` array<string> COMMENT 'from deserializer',
`test_set` array<bigint> COMMENT 'from deserializer',
`test_map` map<string,int> COMMENT 'from deserializer')
COMMENT 'test_all_type'
PARTITIONED BY (
`ds` string COMMENT '日期分区')
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer'
WITH SERDEPROPERTIES (
'serialization.class'='com.xxx.xxx.xxx',
'serialization.format'='org.apache.thrift.protocol.TCompactProtocol')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION
'hdfs://xxxxxxx'
TBLPROPERTIES (