elephant-bird是Twitter的开源项目,项目的地址为 https://github.com/twitter/elephant-bird
该项目是Twitter为LZO,thrift,protocol buffer相关的hadoop InputFormats, OutputFormats, Writables, Pig加载函数, Hive SerDe, HBase二级索引等编写的库
1 2
| mvn clean install -U -Dprotobuf.version=2.5.0 -DskipTests=true
|
mvn package的时候需要签名
以及需要安装apache Thrift和Protocol Buffers
thrift安装参考
1 2
| https://www.cnblogs.com/tonglin0325/p/10190050.html
|
PB安装参考
1 2
| https://www.cnblogs.com/tonglin0325/p/13685527.html
|
使用elephant-bird来建hive表的类型对应关系
该表的表结构不用显示定义,将会自动从’serialization.class’=’com.xxx.xxx.xxx’中自动反序列化出来
参考:https://github.com/twitter/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive
建表之后show create table的结果
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| CREATE EXTERNAL TABLE `xxxx`( `ts` string COMMENT 'from deserializer', `schema` string COMMENT 'from deserializer', `test_string` string COMMENT 'from deserializer', `test_long` bigint COMMENT 'from deserializer', `test_int` int COMMENT 'from deserializer', `test_short` smallint COMMENT 'from deserializer', `test_double` double COMMENT 'from deserializer', `test_byte` tinyint COMMENT 'from deserializer', `test_bool` boolean COMMENT 'from deserializer', `test_list` array<string> COMMENT 'from deserializer', `test_set` array<bigint> COMMENT 'from deserializer', `test_map` map<string,int> COMMENT 'from deserializer') COMMENT 'test_all_type' PARTITIONED BY ( `ds` string COMMENT '日期分区') ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer' WITH SERDEPROPERTIES ( 'serialization.class'='com.xxx.xxx.xxx', 'serialization.format'='org.apache.thrift.protocol.TCompactProtocol') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat' LOCATION 'hdfs://xxxxxxx' TBLPROPERTIES (
|